Hi, @Chen_Shen , I have upload the full log folder to zyc-bit/raylog: ray logs (github.com)
And according to my search, there is no
pid match the workers who failed to start. The number of the
pid is different from the failed worker process.
And the clusters have been running a very long time. But it upgraded Slurm and Reboot last week. Still get the raylet error above. Maybe It is not the oom error.
I upload the logs I just ran here: zyc-bit/raylog_new (github.com)
Thank you very much for always helping me, from the two log folders I uploaded, do you see the cause of the error?
hi @zyc-bit, it’s a bit odd you are hitting this issue and there is no obvious error in the logs. Since you are non an earlier version of Ray, it could be that the worker processes took longer to start:
- either you can change the config
worker_register_timeout_seconds to 60 seconds to see if it resolves the issue. You can set env variable
RAY_worker_register_timeout_seconds=60 in your ray start up command.
- or use latest ray 1.13.0, which has
worker_register_timeout_seconds default to 60.
Well, I have updated my ray to 1.13.0. Still got the error. Acturally, the errors above are produced by both ray 1.12.1 and 1.13
Can I set the env variable
RAY_worker_register_timeout_seconds bigger than 60?
@zyc-bit sure you can try to set 90 or 120 seconds and see if it solves your problem.
I set it to 120 with
RAY_worker_register_timeout_seconds=120 ray start --head
and the problem didn’t solved.
Looks like some pair programming might help here.
I will categorize this as Ray Core for now FYI @Chen_Shen .
yeah @zyc-bit I think the best way to resolve it is for me to reproduce your error with some scripts, or we do some pair debugging. let me know which works for you the best.
Hi, Chen_Shen. Late last night , I set the
RAY_worker_register_timeout_seconds=600 and It worked , but still failed to start dashboard. So still involved in other errors.
So the error of (raylet) can be solved by setting
RAY_worker_register_timeout_seconds to a very very large number.
As for the dashboard failed to start. no special scripts, just run
ray start --head .
And I’d like to do some pair debugging with you.
hi @zyc-bit sounds great, send you message.