Raylet errors some worker have not registered within the timeout

zyc-bit · June 20, 2022, 7:47am

Hi, @Chen_Shen , I have upload the full log folder to zyc-bit/raylog: ray logs (github.com)
And according to my search, there is no python-core-worker-*_{pid}.log that pid match the workers who failed to start. The number of the pid is different from the failed worker process.

And the clusters have been running a very long time. But it upgraded Slurm and Reboot last week. Still get the raylet error above. Maybe It is not the oom error.
I upload the logs I just ran here: zyc-bit/raylog_new (github.com)

zyc-bit · June 21, 2022, 3:33am

Hi，Chen_Shen
Thank you very much for always helping me, from the two log folders I uploaded, do you see the cause of the error?

Chen_Shen · June 21, 2022, 5:06am

hi @zyc-bit, it’s a bit odd you are hitting this issue and there is no obvious error in the logs. Since you are non an earlier version of Ray, it could be that the worker processes took longer to start:

either you can change the config worker_register_timeout_seconds to 60 seconds to see if it resolves the issue. You can set env variable RAY_worker_register_timeout_seconds=60 in your ray start up command.
or use latest ray 1.13.0, which has worker_register_timeout_seconds default to 60.

zyc-bit · June 21, 2022, 5:19am

Well, I have updated my ray to 1.13.0. Still got the error. Acturally, the errors above are produced by both ray 1.12.1 and 1.13
Can I set the env variable RAY_worker_register_timeout_seconds bigger than 60?

Chen_Shen · June 21, 2022, 6:45am

@zyc-bit sure you can try to set 90 or 120 seconds and see if it solves your problem.

zyc-bit · June 21, 2022, 8:13am

I set it to 120 with
RAY_worker_register_timeout_seconds=120 ray start --head
and the problem didn’t solved.

cade · June 21, 2022, 6:48pm

Looks like some pair programming might help here.

I will categorize this as Ray Core for now FYI @Chen_Shen .

Chen_Shen · June 22, 2022, 1:30am

yeah @zyc-bit I think the best way to resolve it is for me to reproduce your error with some scripts, or we do some pair debugging. let me know which works for you the best.

zyc-bit · June 22, 2022, 1:43am

Hi, Chen_Shen. Late last night , I set the RAY_worker_register_timeout_seconds=600 and It worked , but still failed to start dashboard. So still involved in other errors.
So the error of (raylet) can be solved by setting RAY_worker_register_timeout_seconds to a very very large number.
As for the dashboard failed to start. no special scripts, just run ray start --head .
And I’d like to do some pair debugging with you.

Chen_Shen · June 22, 2022, 7:22pm

hi @zyc-bit sounds great, send you message.

aljubrmj · March 5, 2023, 4:04am

I am facing the same problem. What was the solution?

shaohan_tian · March 30, 2023, 3:25am

hello? Have you solved it?

Topic		Replies	Views
Ray crashes on Slurm Ray Clusters	6	1368	October 27, 2022
Issues with Ray Connecting on SLURM Ray Core	0	400	November 1, 2021
Ray on slurm - different ip addresses of worker nodes Ray Clusters	14	3487	August 29, 2023
(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start Ray Clusters	4	2430	May 26, 2022
Ray on SLURM, unmatched Raylet address Ray Clusters	2	982	December 1, 2022

Raylet errors some worker have not registered within the timeout

Related topics