Head pod does not restart after deleting/draining

Andrew_Li · May 19, 2022, 9:18am

Hi there,

I am deploying Ray on K8s using Helm Chart, the environment and the values.yaml shows below:

ray == 1.12.0
runing ray on K8s(GKE)
- 1.21.10-gke.2000

# `values.yaml`
image: rayproject/ray:1.12.0-py38
podTypes:
  rayHeadType
    CPU: 4
    memory: 30Gi
    GPU: 0
    rayResources: { "CPU": 0 }
  rayWorkerType:
    minWorkers: 0
    maxWorkers: 6
    memory: 30Gi
    CPU: 3
    GPU: 0

For any reason, when the head pod was deleted or the node was drained where the head pod lived, the head pod will never be created again.

Now, my workaround is slightly to modify the spec of rayHeadType forcing to restart the Ray cluster(Ray Operator Advanced Configuration — Ray 1.12.1). For instance, I changed the CPU of rayHeadType to 3 and changed it back to 4 again.

Any tips for this situation?

Andrew_Li · August 5, 2022, 4:21am

Hi there, any suggestion?

cade · August 9, 2022, 1:23am

cc @Dmitri who knows a lot about k8s and Ray

Dmitri · August 9, 2022, 2:08am

I’d recommend checking out KubeRay GitHub - ray-project/kuberay: A toolkit to run Ray applications on Kubernetes to deploy Ray on Kubernetes.
Detailed KubeRay documentation and guides are in the works.

Andrew_Li · August 9, 2022, 4:04am

Thanks for working on KubeRay. I did check out this repo but I couldn’t figure out what’s different from the current process of deploying on K8s(Deploying on Kubernetes — Ray 1.13.0). Hope the detailed documentation will point it out.

Andrew_Li · August 9, 2022, 4:07am

Since both Installing the Ray Operator with Helm and KubeRay are using Helm to manage it. Is it going to have an official Helm repo instead of cloning the whole GitHub repo?

Dmitri · August 9, 2022, 4:51am

We will eventually host Helm charts for KubeRay at a public Helm repository.

The primary difference between KubeRay and the operator currently documented in the Ray repo is, frankly, that KubeRay is more stable – e.g. you should not have issues with a head pod that fails to restart.

Dmitri · August 9, 2022, 5:48am

Though returning to the original question, head restarts with the older operator are tested, though I’ve observed that the restart might take a minute or so.
If you could post operator logs (kubectl logs ray-opertor) from when the head fails to restart, that would be helpful.

Topic		Replies	Views
Ray controller restart worker pod after head pod restart Kubernetes	0	381	November 19, 2023
Unable to recover from head-pod failure in k8s Ray Clusters	8	824	March 22, 2022
Ray head node on kubernetes fails to start Kubernetes	0	380	July 29, 2023
Automatically restart head node on kubernetes Kubernetes	3	825	June 24, 2021
Ray Serve Pods Scheduling Failing Ray Serve	3	90	July 26, 2024

Head pod does not restart after deleting/draining

Related topics