Ray k8s cluster, cannot run new task when previous task failed

GoingMyWay · June 22, 2022, 6:19am

Hey @GuyangSong, anything I can do to help you to diagnose?

GuyangSong · June 23, 2022, 2:33am

(raylet, ip=172.24.56.163) [2022-06-17 18:32:17,992 E 73 73] (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please 
run `pip install "ray[default]"`.

Does this error message still appear in your case?

If it appears, can you paste the command line of raylet by “ps -ef | grep raylet”?

GuyangSong · June 23, 2022, 2:38am

By the way, you should see the node raylet, ip=172.24.56.163 which is in the prefix of the error log, instead of the node main.py runs on.

GoingMyWay · June 23, 2022, 12:02pm

Hey @GuyangSong, there is no such error now. But I cannot reuse the cluster.

GoingMyWay · June 23, 2022, 12:05pm

Sorry, I cannot see it. Currently the error shows the same in my previous post: Ray k8s cluster, cannot run new task when previous task failed

Do you have any idea what is wrong with it?

GuyangSong · June 23, 2022, 12:28pm

Have you set the runtime_env ? Is the components module located in your working_dir?

GoingMyWay · June 23, 2022, 1:08pm

@GuyangSong, For the first run, I did not set it. Then, I set it and ran the code.

pid=gcs_server) [2022-06-23 21:02:30,624 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 512e4dd976cf969e81ae8b479ad888a40cae2f8a7c89aa76a023f104 for actor 4b60b9fcc[0/269]
bcd40a5601000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,633 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 7bc55c1eecaa08f9
fa80dbd901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,642 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor 5497df4a81fac901
e1be7ec401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,652 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e6aa9f3bfb8cea4d
b7d08b8401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,668 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 6e15381ff4d31a63
3c77974d01000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,684 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor bb6379f2a6cb30db
f408263901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
[INFO 21:02:30] run_meltingpot Buffer size: 600
[ERROR 21:02:30] pymarl Failed after 0:00:03!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 66, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
    buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 509, in create_buffer
    assert ray.get(buffer.ready.remote())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
    raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.

(raylet) [2022-06-23 21:02:30,763 C 109 109] (raylet) dependency_manager.cc:208:  Check failed: task_entry != queued_task_requests_.end() Can't remove dependencies of tasks that are not queued.
(raylet) *** StackTrace Information ***
(raylet)     ray::SpdLogMessage::Flush()
(raylet)     ray::RayLog::~RayLog()
(raylet)     ray::raylet::DependencyManager::RemoveTaskDependencies()
(raylet)     ray::raylet::ClusterTaskManager::PoppedWorkerHandler()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     boost::asio::detail::wait_handler<>::do_complete()
(raylet)     boost::asio::detail::scheduler::do_run_one()
(raylet)     boost::asio::detail::scheduler::run()
(raylet)     boost::asio::io_context::run()
(raylet)     main
(raylet)     __libc_start_main
(raylet) 
(pid=gcs_server) [2022-06-23 21:02:30,709 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e23a3d4127997687
6bdce53201000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,734 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 167498de1e2bf62a2035943e1f85515f74c77677c92fdddc217ae725 for actor 57c6f66434f69b96
3200f29d01000000(ReplayBufferwithQueue.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED

Then. I also launched a new cluster and ran the code. I got the same error.

GoingMyWay · June 23, 2022, 1:09pm

The components is the code of my project. I used docker to mount my code.

@GuyangSong, Hey you can see this post for more information: Ray k8s cluster, cannot run new task when previous task failed - #14 by GoingMyWay

GoingMyWay · June 23, 2022, 1:12pm

I do think runtime_env is needed in docker as I use docker and k8s. In each pod, the python env and the code have been already set.

GuyangSong · June 27, 2022, 4:41am

If your components module isn’t installed in the python environment, the runtime_env is needed.

GuyangSong · June 27, 2022, 4:44am

For this error, I need to see your dashboard_agent.log and the command line of raylet process. We need to logs of the node which the task was scheduled to, not your main script’s node.

GoingMyWay · June 27, 2022, 7:12am

Hi, with docker, why runtime_env is still needed? Without runtime_env, I can run the code without any errors. However, the cluster cannot be used twice.

GoingMyWay · June 27, 2022, 7:13am

Hi, what should I do? Putting the output of dashboard_agent.log here?

GuyangSong · June 27, 2022, 7:33am

ModuleNotFoundError: No module named 'components'

It indicated there was no module named ‘components’ found in your worker nodes. So I think you need runtime_env. You can ssh to your worker node and run python -c "import components"

GuyangSong · June 27, 2022, 7:35am

Yep, we need to take a look at dashboard_agent.log and the output of ps -ef|grep raylet

GoingMyWay · June 27, 2022, 9:48am

Hi, @GuyangSong, components is a folder of my project. It is already in the workspace. That is why I think adding runtime_env is not a wise design pattern for k8s ray clusters. All codes are in the workspace in each pod. As a k8s ray cluster user, actually I do not need to install dependencies.

By the way, could you please also read my comments in this post? They provide a lot of useful information. I have already made many attempts to debug the cluster.

This reply shows how the cluster is register created with helm and how to launch a new cluster.

GuyangSong · June 27, 2022, 10:13am

Although your project is in the workspace, you need to ensure the python environment(ray process) could find your package. If not, you must set the runtime_env.

@yic @architkulkarni wdyt?

GoingMyWay · June 27, 2022, 10:17am

When I first run the code, I am curious why ray can find the code and the packages and run the code without error, and the second time it fails?

GuyangSong · June 27, 2022, 12:11pm

Maybe the reason is what @yic has commented here ?

GoingMyWay · June 27, 2022, 2:56pm

Maybe, but I think it is weird.

Topic		Replies	Views
Question about Ray Cluster/ Ray on prem Ray Clusters	6	747	June 15, 2021
Runtime_env fails when running Ray in Docker Ray Core	8	2006	April 6, 2022
Raylet worker processes are failing Ray Core	3	250	March 5, 2025
Ray / gRPC Ambiguous Error Message Kubernetes	12	2244	May 13, 2022
Ray 1.7.0 ray.init(runtime_env=) kills cluster (was: cluster stuck on "The actor or task with ID [] cannot be scheduled right now") Ray Core	5	1267	October 18, 2021

Ray k8s cluster, cannot run new task when previous task failed

Related topics