Hi there,
I am deploying Ray on K8s using Helm Chart, the environment and the values.yaml
shows below:
- ray == 1.12.0
- runing ray on K8s(GKE)
- 1.21.10-gke.2000
# `values.yaml`
image: rayproject/ray:1.12.0-py38
podTypes:
rayHeadType
CPU: 4
memory: 30Gi
GPU: 0
rayResources: { "CPU": 0 }
rayWorkerType:
minWorkers: 0
maxWorkers: 6
memory: 30Gi
CPU: 3
GPU: 0
For any reason, when the head pod was deleted or the node was drained where the head pod lived, the head pod will never be created again.
Now, my workaround is slightly to modify the spec of rayHeadType
forcing to restart the Ray cluster(Ray Operator Advanced Configuration — Ray 1.12.1). For instance, I changed the CPU of rayHeadType
to 3
and changed it back to 4
again.
Any tips for this situation?