Ray init fails to register workers

Andrea_Pisoni · May 6, 2022, 6:40am

Hi, I am seeing a strange issue with the newer versions of ray (1.10 and above) where if I start ray from jupyterlab with ray.init() i get a kernel crash and Failed to register worker xxx to Raylet. Invalid: Invalid: Unknown worker.

However, if i run ray start --head from the shell, and connect with ray.init(address=‘auto’) it works fine.

Any reason why ray.init() might be going on timeout? Where should i start looking?

jjyao · August 15, 2022, 5:58pm

Sorry for the late reply. Does it still fail with latest Ray (1.13.0)?

Andrea_Pisoni · August 15, 2022, 11:51pm

Hey jjyao!

Yes but we found out that the problem is simply that Ray by default takes the number of CPU from nprocs which won’t work for example on Openshift where your container cpu is defined by the cgroup while nprocs will give you the cpu of the physical host.

So Ray thought I had 44 cpu cores, while I only had 4, was spawning way too many workers and everything was dying.

I guess if you wanted to “fix” this you could set Ray to look at the cpu.share cgroup when on containers, but honestly simply adding the num_cpus manually solves it.

As a side note, this same problem affects the dashboard, it will always show 44 cores per host no matter what I do, which is of course wrong, I wonder why it shows it differently from ray.state, but again, it’s just minor.

jjyao · August 16, 2022, 11:46pm

Thanks for the response!

We do have logic to auto detect number of cpus (ray/utils.py at 693856975ab135cd513c41a72b0455fb3385e14b · ray-project/ray · GitHub) correctly inside docker but seems that it doesn’t work for your case? If that’s the case, do you mind filing a github issue with your container setup so we can try to reproduce and fix it.

Andrea_Pisoni · August 17, 2022, 12:19am

Sure I’ll open a github issue! I’m thinking it might be because we create the cluster explicitly by starting workers on the pods explicitly pointing to the Head IP, not with Ray up. So maybe Ray doesn’t know I’m on a cluster with containers as usually one would follow this process with VMs.

Reason why we don’t use Ray up is that we are on private cloud (openshift) in an enterprise setting and we don’t have access to the openshift CLI. We are given an API to spawn pods with resources and execute a command/script.

So my flow looks a bit like this:

head_ip = spawn_pods(1, cpu, mem, gpu, cmd= ray start - -head - - block)

workers = spawn_pods(n_workers, cpu, mem, gpu, cmd = ray start - -address head_ip - -block)

ray.init(address=head_ip)

If following this the check_docker_cpu code is not triggered, having something like a - -container flag on Ray start to tell Ray I’m on a cluster would do it for my use case.

jjyao · August 17, 2022, 12:25am

ray start should trigger the docker cpu detection code I mentioned. It would be nice that I can create the same container so I can reproduce it.

Or if you can help, you can add some logs to the code I mentioned and see where is wrong. It’s just a utility function so you can just run it in your terminal without starting Ray.

Andrea_Pisoni · August 17, 2022, 6:38am

@jjyao OK i have done a little bit of debugging. The problem is that Ray loses the right CPU count when our OpenShift has CPU Bursting active. If I start a pod with 1 CPU, without CPU bursting available, then /sys/fs/cgroup/cpu/cpu.cfs_quota_us returns the right amount. However, if I have CPU bursting on, it returns -1.

In this case, I think the /sys/fs/cgroup/cpu/cpu.shares contains the “guaranteed” millicores assigned to the pod, however Ray is not looking at that file right now.

jjyao · August 17, 2022, 2:16pm

cc @Alex Is this something we should handle? (cpu bursting case)

Alex · August 17, 2022, 4:43pm

hmmm I’m not familiar with openshift cpu bursting. Can we file an issue on github and discuss supporting it there? naively it seems to make sense to have more robust cgroup support.

jjyao · August 17, 2022, 5:28pm

Created [Core] Cannot detect number of cpus inside docker correctly with cpu bursting · Issue #27958 · ray-project/ray · GitHub for tracking.

Topic		Replies	Views
[Core] Ray.init() hanging Ray Core	5	2459	December 21, 2021
Ray.init() hanging with conda (pip) installation Ray Core	1	607	April 20, 2022
Issues in ray.init() functionality	1	458	December 21, 2020
Worker node workers/cores aren't not working	1	596	May 2, 2022
Crash when reaching 30 workers Ray Core	6	1780	October 19, 2022

Ray init fails to register workers

Related topics