Raylet errors some worker have not registered within the timeout

Hi, @Chen_Shen , I have upload the full log folder to zyc-bit/raylog: ray logs (github.com)
And according to my search, there is no python-core-worker-*_{pid}.log that pid match the workers who failed to start. The number of the pid is different from the failed worker process.

And the clusters have been running a very long time. But it upgraded Slurm and Reboot last week. Still get the raylet error above. Maybe It is not the oom error.
I upload the logs I just ran here: zyc-bit/raylog_new (github.com)

1 Like

Hiļ¼ŒChen_Shen
Thank you very much for always helping me, from the two log folders I uploaded, do you see the cause of the error?

hi @zyc-bit, itā€™s a bit odd you are hitting this issue and there is no obvious error in the logs. Since you are non an earlier version of Ray, it could be that the worker processes took longer to start:

  • either you can change the config worker_register_timeout_seconds to 60 seconds to see if it resolves the issue. You can set env variable RAY_worker_register_timeout_seconds=60 in your ray start up command.
  • or use latest ray 1.13.0, which has worker_register_timeout_seconds default to 60.

Well, I have updated my ray to 1.13.0. Still got the error. Acturally, the errors above are produced by both ray 1.12.1 and 1.13
Can I set the env variable RAY_worker_register_timeout_seconds bigger than 60?

@zyc-bit sure you can try to set 90 or 120 seconds and see if it solves your problem.

I set it to 120 with
RAY_worker_register_timeout_seconds=120 ray start --head
and the problem didnā€™t solved. :frowning:

Looks like some pair programming might help here.

I will categorize this as Ray Core for now FYI @Chen_Shen .

1 Like

yeah @zyc-bit I think the best way to resolve it is for me to reproduce your error with some scripts, or we do some pair debugging. let me know which works for you the best.

Hi, Chen_Shen. Late last night , I set the RAY_worker_register_timeout_seconds=600 and It worked , but still failed to start dashboard. So still involved in other errors.
So the error of (raylet) can be solved by setting RAY_worker_register_timeout_seconds to a very very large number.
As for the dashboard failed to start. no special scripts, just run ray start --head .
And Iā€™d like to do some pair debugging with you.

hi @zyc-bit sounds great, send you message.

I am facing the same problem. What was the solution?

hello? Have you solved it?