Head pod does not restart after deleting/draining

Hi there,

I am deploying Ray on K8s using Helm Chart, the environment and the values.yaml shows below:

  • ray == 1.12.0
  • runing ray on K8s(GKE)
    • 1.21.10-gke.2000
# `values.yaml`
image: rayproject/ray:1.12.0-py38
    CPU: 4
    memory: 30Gi
    GPU: 0
    rayResources: { "CPU": 0 }
    minWorkers: 0
    maxWorkers: 6
    memory: 30Gi
    CPU: 3
    GPU: 0

For any reason, when the head pod was deleted or the node was drained where the head pod lived, the head pod will never be created again.

Now, my workaround is slightly to modify the spec of rayHeadType forcing to restart the Ray cluster(Ray Operator Advanced Configuration — Ray 1.12.1). For instance, I changed the CPU of rayHeadType to 3 and changed it back to 4 again.

Any tips for this situation?

Hi there, any suggestion?

cc @Dmitri who knows a lot about k8s and Ray

I’d recommend checking out KubeRay GitHub - ray-project/kuberay: A toolkit to run Ray applications on Kubernetes to deploy Ray on Kubernetes.
Detailed KubeRay documentation and guides are in the works.

Thanks for working on KubeRay. I did check out this repo but I couldn’t figure out what’s different from the current process of deploying on K8s(Deploying on Kubernetes — Ray 1.13.0). Hope the detailed documentation will point it out.

Since both Installing the Ray Operator with Helm and KubeRay are using Helm to manage it. Is it going to have an official Helm repo instead of cloning the whole GitHub repo?

We will eventually host Helm charts for KubeRay at a public Helm repository.

The primary difference between KubeRay and the operator currently documented in the Ray repo is, frankly, that KubeRay is more stable – e.g. you should not have issues with a head pod that fails to restart.

1 Like

Though returning to the original question, head restarts with the older operator are tested, though I’ve observed that the restart might take a minute or so.
If you could post operator logs (kubectl logs ray-opertor) from when the head fails to restart, that would be helpful.