Ray k8s cluster, cannot run new task when previous task failed

@GuyangSong, For the first run, I did not set it. Then, I set it and ran the code.

pid=gcs_server) [2022-06-23 21:02:30,624 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 512e4dd976cf969e81ae8b479ad888a40cae2f8a7c89aa76a023f104 for actor 4b60b9fcc[0/269]
bcd40a5601000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,633 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 7bc55c1eecaa08f9
fa80dbd901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,642 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor 5497df4a81fac901
e1be7ec401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,652 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e6aa9f3bfb8cea4d
b7d08b8401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,668 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 6e15381ff4d31a63
3c77974d01000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,684 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor bb6379f2a6cb30db
f408263901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
[INFO 21:02:30] run_meltingpot Buffer size: 600
[ERROR 21:02:30] pymarl Failed after 0:00:03!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 66, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
    buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 509, in create_buffer
    assert ray.get(buffer.ready.remote())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
    raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.

(raylet) [2022-06-23 21:02:30,763 C 109 109] (raylet) dependency_manager.cc:208:  Check failed: task_entry != queued_task_requests_.end() Can't remove dependencies of tasks that are not queued.
(raylet) *** StackTrace Information ***
(raylet)     ray::SpdLogMessage::Flush()
(raylet)     ray::RayLog::~RayLog()
(raylet)     ray::raylet::DependencyManager::RemoveTaskDependencies()
(raylet)     ray::raylet::ClusterTaskManager::PoppedWorkerHandler()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     boost::asio::detail::wait_handler<>::do_complete()
(raylet)     boost::asio::detail::scheduler::do_run_one()
(raylet)     boost::asio::detail::scheduler::run()
(raylet)     boost::asio::io_context::run()
(raylet)     main
(raylet)     __libc_start_main
(raylet) 
(pid=gcs_server) [2022-06-23 21:02:30,709 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e23a3d4127997687
6bdce53201000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,734 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 167498de1e2bf62a2035943e1f85515f74c77677c92fdddc217ae725 for actor 57c6f66434f69b96
3200f29d01000000(ReplayBufferwithQueue.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED

Then. I also launched a new cluster and ran the code. I got the same error.