Ray k8s cluster, cannot run new task when previous task failed

Hey @GuyangSong, anything I can do to help you to diagnose?

(raylet, ip=172.24.56.163) [2022-06-17 18:32:17,992 E 73 73] (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please 
run `pip install "ray[default]"`.

Does this error message still appear in your case?

If it appears, can you paste the command line of raylet by “ps -ef | grep raylet”?

By the way, you should see the node raylet, ip=172.24.56.163 which is in the prefix of the error log, instead of the node main.py runs on.

Hey @GuyangSong, there is no such error now. But I cannot reuse the cluster.

Sorry, I cannot see it. Currently the error shows the same in my previous post: Ray k8s cluster, cannot run new task when previous task failed

Do you have any idea what is wrong with it?

Have you set the runtime_env ? Is the components module located in your working_dir?

@GuyangSong, For the first run, I did not set it. Then, I set it and ran the code.

pid=gcs_server) [2022-06-23 21:02:30,624 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 512e4dd976cf969e81ae8b479ad888a40cae2f8a7c89aa76a023f104 for actor 4b60b9fcc[0/269]
bcd40a5601000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,633 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 7bc55c1eecaa08f9
fa80dbd901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,642 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor 5497df4a81fac901
e1be7ec401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,652 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e6aa9f3bfb8cea4d
b7d08b8401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,668 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 6e15381ff4d31a63
3c77974d01000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,684 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor bb6379f2a6cb30db
f408263901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
[INFO 21:02:30] run_meltingpot Buffer size: 600
[ERROR 21:02:30] pymarl Failed after 0:00:03!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 66, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
    buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 509, in create_buffer
    assert ray.get(buffer.ready.remote())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
    raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.

(raylet) [2022-06-23 21:02:30,763 C 109 109] (raylet) dependency_manager.cc:208:  Check failed: task_entry != queued_task_requests_.end() Can't remove dependencies of tasks that are not queued.
(raylet) *** StackTrace Information ***
(raylet)     ray::SpdLogMessage::Flush()
(raylet)     ray::RayLog::~RayLog()
(raylet)     ray::raylet::DependencyManager::RemoveTaskDependencies()
(raylet)     ray::raylet::ClusterTaskManager::PoppedWorkerHandler()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     boost::asio::detail::wait_handler<>::do_complete()
(raylet)     boost::asio::detail::scheduler::do_run_one()
(raylet)     boost::asio::detail::scheduler::run()
(raylet)     boost::asio::io_context::run()
(raylet)     main
(raylet)     __libc_start_main
(raylet) 
(pid=gcs_server) [2022-06-23 21:02:30,709 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e23a3d4127997687
6bdce53201000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,734 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 167498de1e2bf62a2035943e1f85515f74c77677c92fdddc217ae725 for actor 57c6f66434f69b96
3200f29d01000000(ReplayBufferwithQueue.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED

Then. I also launched a new cluster and ran the code. I got the same error.

The components is the code of my project. I used docker to mount my code.

@GuyangSong, Hey you can see this post for more information: Ray k8s cluster, cannot run new task when previous task failed - #14 by GoingMyWay

I do think runtime_env is needed in docker as I use docker and k8s. In each pod, the python env and the code have been already set.

If your components module isn’t installed in the python environment, the runtime_env is needed.

For this error, I need to see your dashboard_agent.log and the command line of raylet process. We need to logs of the node which the task was scheduled to, not your main script’s node.

Hi, with docker, why runtime_env is still needed? Without runtime_env, I can run the code without any errors. However, the cluster cannot be used twice.

Hi, what should I do? Putting the output of dashboard_agent.log here?

ModuleNotFoundError: No module named 'components'  

It indicated there was no module named ‘components’ found in your worker nodes. So I think you need runtime_env. You can ssh to your worker node and run python -c "import components"

Yep, we need to take a look at dashboard_agent.log and the output of ps -ef|grep raylet

Hi, @GuyangSong, components is a folder of my project. It is already in the workspace. That is why I think adding runtime_env is not a wise design pattern for k8s ray clusters. All codes are in the workspace in each pod. As a k8s ray cluster user, actually I do not need to install dependencies.

By the way, could you please also read my comments in this post? They provide a lot of useful information. I have already made many attempts to debug the cluster.

This reply shows how the cluster is register created with helm and how to launch a new cluster.

Although your project is in the workspace, you need to ensure the python environment(ray process) could find your package. If not, you must set the runtime_env.

@yic @architkulkarni wdyt?

When I first run the code, I am curious why ray can find the code and the packages and run the code without error, and the second time it fails?

Maybe the reason is what @yic has commented here ?

Maybe, but I think it is weird.