For any reason, when the head pod was deleted or the node was drained where the head pod lived, the head pod will never be created again.
Now, my workaround is slightly to modify the spec of rayHeadType forcing to restart the Ray cluster(Ray Operator Advanced Configuration — Ray 1.12.1). For instance, I changed the CPU of rayHeadType to 3 and changed it back to 4 again.
Thanks for working on KubeRay. I did check out this repo but I couldn’t figure out what’s different from the current process of deploying on K8s(Deploying on Kubernetes — Ray 1.13.0). Hope the detailed documentation will point it out.
We will eventually host Helm charts for KubeRay at a public Helm repository.
The primary difference between KubeRay and the operator currently documented in the Ray repo is, frankly, that KubeRay is more stable – e.g. you should not have issues with a head pod that fails to restart.
Though returning to the original question, head restarts with the older operator are tested, though I’ve observed that the restart might take a minute or so.
If you could post operator logs (kubectl logs ray-opertor) from when the head fails to restart, that would be helpful.