Hi, I am seeing a strange issue with the newer versions of ray (1.10 and above) where if I start ray from jupyterlab with ray.init() i get a kernel crash and
Failed to register worker xxx to Raylet. Invalid: Invalid: Unknown worker.
However, if i run ray start --head from the shell, and connect with ray.init(address=‘auto’) it works fine.
Any reason why ray.init() might be going on timeout? Where should i start looking?
Sorry for the late reply. Does it still fail with latest Ray (1.13.0)?
Yes but we found out that the problem is simply that Ray by default takes the number of CPU from nprocs which won’t work for example on Openshift where your container cpu is defined by the cgroup while nprocs will give you the cpu of the physical host.
So Ray thought I had 44 cpu cores, while I only had 4, was spawning way too many workers and everything was dying.
I guess if you wanted to “fix” this you could set Ray to look at the
cpu.share cgroup when on containers, but honestly simply adding the num_cpus manually solves it.
As a side note, this same problem affects the dashboard, it will always show 44 cores per host no matter what I do, which is of course wrong, I wonder why it shows it differently from ray.state, but again, it’s just minor.
Thanks for the response!
We do have logic to auto detect number of cpus (ray/utils.py at 693856975ab135cd513c41a72b0455fb3385e14b · ray-project/ray · GitHub) correctly inside docker but seems that it doesn’t work for your case? If that’s the case, do you mind filing a github issue with your container setup so we can try to reproduce and fix it.
Sure I’ll open a github issue! I’m thinking it might be because we create the cluster explicitly by starting workers on the pods explicitly pointing to the Head IP, not with Ray up. So maybe Ray doesn’t know I’m on a cluster with containers as usually one would follow this process with VMs.
Reason why we don’t use Ray up is that we are on private cloud (openshift) in an enterprise setting and we don’t have access to the openshift CLI. We are given an API to spawn pods with resources and execute a command/script.
So my flow looks a bit like this:
head_ip = spawn_pods(1, cpu, mem, gpu, cmd= ray start - -head - - block)
workers = spawn_pods(n_workers, cpu, mem, gpu, cmd = ray start - -address head_ip - -block)
If following this the check_docker_cpu code is not triggered, having something like a - -container flag on Ray start to tell Ray I’m on a cluster would do it for my use case.
ray start should trigger the docker cpu detection code I mentioned. It would be nice that I can create the same container so I can reproduce it.
Or if you can help, you can add some logs to the code I mentioned and see where is wrong. It’s just a utility function so you can just run it in your terminal without starting Ray.
@jjyao OK i have done a little bit of debugging. The problem is that Ray loses the right CPU count when our OpenShift has CPU Bursting active. If I start a pod with 1 CPU, without CPU bursting available, then
/sys/fs/cgroup/cpu/cpu.cfs_quota_us returns the right amount. However, if I have CPU bursting on, it returns
In this case, I think the
/sys/fs/cgroup/cpu/cpu.shares contains the “guaranteed” millicores assigned to the pod, however Ray is not looking at that file right now.
cc @Alex Is this something we should handle? (cpu bursting case)
hmmm I’m not familiar with openshift cpu bursting. Can we file an issue on github and discuss supporting it there? naively it seems to make sense to have more robust cgroup support.