Hey @GuyangSong, anything I can do to help you to diagnose?
(raylet, ip=172.24.56.163) [2022-06-17 18:32:17,992 E 73 73] (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please
run `pip install "ray[default]"`.
Does this error message still appear in your case?
If it appears, can you paste the command line of raylet by âps -ef | grep rayletâ?
By the way, you should see the node raylet, ip=172.24.56.163
which is in the prefix of the error log, instead of the node main.py
runs on.
Hey @GuyangSong, there is no such error now. But I cannot reuse the cluster.
Sorry, I cannot see it. Currently the error shows the same in my previous post: Ray k8s cluster, cannot run new task when previous task failed
Do you have any idea what is wrong with it?
Have you set the runtime_env
? Is the components
module located in your working_dir
?
@GuyangSong, For the first run, I did not set it. Then, I set it and ran the code.
pid=gcs_server) [2022-06-23 21:02:30,624 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 512e4dd976cf969e81ae8b479ad888a40cae2f8a7c89aa76a023f104 for actor 4b60b9fcc[0/269]
bcd40a5601000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,633 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 7bc55c1eecaa08f9
fa80dbd901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,642 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor 5497df4a81fac901
e1be7ec401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,652 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e6aa9f3bfb8cea4d
b7d08b8401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,668 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 6e15381ff4d31a63
3c77974d01000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,684 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor bb6379f2a6cb30db
f408263901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
[INFO 21:02:30] run_meltingpot Buffer size: 600
[ERROR 21:02:30] pymarl Failed after 0:00:03!
Traceback (most recent calls WITHOUT Sacred internals):
File "/home/me/app/epymarl/src/main.py", line 66, in my_main
run_train_meltingpot(_run, config, _log)
File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
run_sequential(args=args, logger=logger)
File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
File "/home/me/app/epymarl/src/run_meltingpot.py", line 509, in create_buffer
assert ray.get(buffer.ready.remote())
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.
(raylet) [2022-06-23 21:02:30,763 C 109 109] (raylet) dependency_manager.cc:208: Check failed: task_entry != queued_task_requests_.end() Can't remove dependencies of tasks that are not queued.
(raylet) *** StackTrace Information ***
(raylet) ray::SpdLogMessage::Flush()
(raylet) ray::RayLog::~RayLog()
(raylet) ray::raylet::DependencyManager::RemoveTaskDependencies()
(raylet) ray::raylet::ClusterTaskManager::PoppedWorkerHandler()
(raylet) std::_Function_handler<>::_M_invoke()
(raylet) std::_Function_handler<>::_M_invoke()
(raylet) std::_Function_handler<>::_M_invoke()
(raylet) std::_Function_handler<>::_M_invoke()
(raylet) boost::asio::detail::wait_handler<>::do_complete()
(raylet) boost::asio::detail::scheduler::do_run_one()
(raylet) boost::asio::detail::scheduler::run()
(raylet) boost::asio::io_context::run()
(raylet) main
(raylet) __libc_start_main
(raylet)
(pid=gcs_server) [2022-06-23 21:02:30,709 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e23a3d4127997687
6bdce53201000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,734 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 167498de1e2bf62a2035943e1f85515f74c77677c92fdddc217ae725 for actor 57c6f66434f69b96
3200f29d01000000(ReplayBufferwithQueue.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
Then. I also launched a new cluster and ran the code. I got the same error.
The components
is the code of my project. I used docker to mount my code.
@GuyangSong, Hey you can see this post for more information: Ray k8s cluster, cannot run new task when previous task failed - #14 by GoingMyWay
I do think runtime_env
is needed in docker as I use docker and k8s. In each pod, the python env and the code have been already set.
If your components
module isnât installed in the python environment, the runtime_env
is needed.
For this error, I need to see your dashboard_agent.log and the command line of raylet
process. We need to logs of the node which the task was scheduled to, not your main scriptâs node.
Hi, with docker, why runtime_env
is still needed? Without runtime_env
, I can run the code without any errors. However, the cluster cannot be used twice.
Hi, what should I do? Putting the output of dashboard_agent.log
here?
ModuleNotFoundError: No module named 'components'
It indicated there was no module named âcomponentsâ found in your worker nodes. So I think you need runtime_env
. You can ssh to your worker node and run python -c "import components"
Yep, we need to take a look at dashboard_agent.log and the output of ps -ef|grep raylet
Hi, @GuyangSong, components
is a folder of my project. It is already in the workspace. That is why I think adding runtime_env
is not a wise design pattern for k8s ray clusters. All codes are in the workspace in each pod. As a k8s ray cluster user, actually I do not need to install dependencies.
By the way, could you please also read my comments in this post? They provide a lot of useful information. I have already made many attempts to debug the cluster.
This reply shows how the cluster is register created with helm and how to launch a new cluster.
Although your project is in the workspace, you need to ensure the python environment(ray process) could find your package. If not, you must set the runtime_env
.
@yic @architkulkarni wdyt?
When I first run the code, I am curious why ray can find the code and the packages and run the code without error, and the second time it fails?
Maybe, but I think it is weird.