Hi, @Chen_Shen , I have upload the full log folder to zyc-bit/raylog: ray logs (github.com)
And according to my search, there is no python-core-worker-*_{pid}.log that pid match the workers who failed to start. The number of the pid is different from the failed worker process.
And the clusters have been running a very long time. But it upgraded Slurm and Reboot last week. Still get the raylet error above. Maybe It is not the oom error.
I upload the logs I just ran here: zyc-bit/raylog_new (github.com)
hi @zyc-bit, itās a bit odd you are hitting this issue and there is no obvious error in the logs. Since you are non an earlier version of Ray, it could be that the worker processes took longer to start:
either you can change the config worker_register_timeout_seconds to 60 seconds to see if it resolves the issue. You can set env variable RAY_worker_register_timeout_seconds=60 in your ray start up command.
or use latest ray 1.13.0, which has worker_register_timeout_seconds default to 60.
Well, I have updated my ray to 1.13.0. Still got the error. Acturally, the errors above are produced by both ray 1.12.1 and 1.13
Can I set the env variable RAY_worker_register_timeout_seconds bigger than 60?
yeah @zyc-bit I think the best way to resolve it is for me to reproduce your error with some scripts, or we do some pair debugging. let me know which works for you the best.
Hi, Chen_Shen. Late last night , I set the RAY_worker_register_timeout_seconds=600 and It worked , but still failed to start dashboard. So still involved in other errors.
So the error of (raylet) can be solved by setting RAY_worker_register_timeout_seconds to a very very large number.
As for the dashboard failed to start. no special scripts, just run ray start --head .
And Iād like to do some pair debugging with you.