[Core] Ray.init() hanging

Hello,

I have a training script that runs well locally on my machine, but not always when I deploy it on an online machine (just on one node, not in cluster/autoscaler mode). After doing some debugging, it seems that I don’t get past the call to ray.init(), although I don’t get any explicit error either. Sometimes it works, sometimes it works after some (long) time, but often the initialisation never ends.

I’ve looked at the log of when it fails in /tmp/ray/{expid}/logs and I’m seeing some Failed to register worker 135fa48518c1e9860c4f8ddae75d2dce6cee220df8b4882ed8e1b9f3 to Raylet. Invalid: Invalid: Unknown worker errors. Also getting some warnings but I’m not sure which are relevant.

Here are all the log files (with cat *): https://pastebin.com/raw/EKaHyiBx. If someone has a clue of where I should look or what I could try to debug this, please let me know!

I’ve also tried specifying more info in the ray.init but with no success: ray.init(include_dashboard=False, _temp_dir='/global/scratch/ray_logs', num_cpus=20, object_store_memory=500e6, _memory=500e6) but without more success.

Hey @nathanlct , sorry for the long wait. This is due to your question not being categorized. Could you make sure that for any new posts you add a category (e.g. “Ray Core” or “RLlib”) to your question? This helps us assign the right person to respond more quickly.

@Stephanie_Wang , could someone from the Ray Core team answer this one here?
Thanks :slight_smile:

Sorry for the late reply. The reason is

[2021-05-11 08:36:38,834 I 29410 29410] worker_pool.cc:342: Some workers of the worker process(29490) have not registered to raylet within timeout.

So the timeout happens first and the worker is removed from raylet and later on when the worker tries to register, we cannot find it anymore.

You can increase the timeout by setting RAY_worker_register_timeout_seconds environment variable.