[Core] Ray.init() hanging

nathanlct · May 11, 2021, 9:53pm

Hello,

I have a training script that runs well locally on my machine, but not always when I deploy it on an online machine (just on one node, not in cluster/autoscaler mode). After doing some debugging, it seems that I don’t get past the call to ray.init(), although I don’t get any explicit error either. Sometimes it works, sometimes it works after some (long) time, but often the initialisation never ends.

I’ve looked at the log of when it fails in /tmp/ray/{expid}/logs and I’m seeing some Failed to register worker 135fa48518c1e9860c4f8ddae75d2dce6cee220df8b4882ed8e1b9f3 to Raylet. Invalid: Invalid: Unknown worker errors. Also getting some warnings but I’m not sure which are relevant.

Here are all the log files (with cat *): https://pastebin.com/raw/EKaHyiBx. If someone has a clue of where I should look or what I could try to debug this, please let me know!

I’ve also tried specifying more info in the ray.init but with no success: ray.init(include_dashboard=False, _temp_dir='/global/scratch/ray_logs', num_cpus=20, object_store_memory=500e6, _memory=500e6) but without more success.

sven1977 · June 3, 2021, 2:39pm

Hey @nathanlct , sorry for the long wait. This is due to your question not being categorized. Could you make sure that for any new posts you add a category (e.g. “Ray Core” or “RLlib”) to your question? This helps us assign the right person to respond more quickly.

@Stephanie_Wang , could someone from the Ray Core team answer this one here?
Thanks

jjyao · November 24, 2021, 5:17pm

Sorry for the late reply. The reason is

[2021-05-11 08:36:38,834 I 29410 29410] worker_pool.cc:342: Some workers of the worker process(29490) have not registered to raylet within timeout.

So the timeout happens first and the worker is removed from raylet and later on when the worker tries to register, we cannot find it anymore.

You can increase the timeout by setting RAY_worker_register_timeout_seconds environment variable.

addisonklinke · December 21, 2021, 8:23pm

@jjyao This does not solve this issue for me, regardless of whether I set the environment variable via bash (export RAY_worker_register_timeout_seconds=30) or python (os.environ['RAY_worker_register_timeout_seconds'] = '30').

My main script runs fine with tune.run(resources_per_trial={'gpu': 1}), but runs into the “Failed to register worker” error when using gpus > 1

jjyao · December 21, 2021, 8:54pm

Hi @addisonklinke,

If that’s the case, could you file a github issue with a reproducible script? Thanks!

addisonklinke · December 21, 2021, 10:15pm

@jjyao Certainly, I just filed a reproducible script from one of the official tutorials

[Bug] Failed to register worker to Raylet for single node, multi-GPU #21226

Topic		Replies	Views
Ray on slurm - Problems with initialization Ray Clusters	6	3660	December 29, 2022
(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start Ray Clusters	4	2522	May 26, 2022
(raylet core_worker.cc:451: Failed to register worker to Raylet. Invalid: Invalid: Unknown worker Ray Core	2	796	January 10, 2022
Raylet errors some worker have not registered within the timeout Ray Core	31	3671	March 30, 2023
Ray.init() hanging with conda (pip) installation Ray Core	1	623	April 20, 2022

[Core] Ray.init() hanging

Related topics