What happened + What you expected to happen
We deployed a Ray cluster (v2.34.0) using KubeRay on K8S, with external Redis enabled to support GCS fault tolerance.
During reliability testing, we injected process failures into the HeadPod 219 times (kill -9 [raylet pid/gcs_server pid/etc.]). We observed that, with some probability, WorkerPods also restarted unexpectedly following a HeadPod process failure.
We expected only the HeadPod to be affected, with WorkerPods remaining stable.
Describe
- Bug: Injecting process failures into the HeadPod may trigger restarts of WorkerPods as well.
- Expected Behavior: Only the HeadPod should be affected by injected failures; WorkerPods should continue to operate normally.
- Additional Observations:
- Lowering the value of the environment variable
RAY_gcs_rpc_server_reconnect_timeout_s
on the HeadPod reduces the likelihood of WorkerPod restarts, but does not completely eliminate them.
- Lowering the value of the environment variable
| NAME | READY | STATUS | RESTARTS | AGE |
|-------------------------------------------|--------|------------------|-------------------|-------|
| kuberay-worker-jobexecutorgroup-hdnf2 | 1/1 | Running | 9 (3h47m ago) | — |
| kuberay-worker-jobexecutorgroup-vvhza | 1/1 | Running | 3 (35m ago) | — |
| kuberay-worker-jobexecutorgroup-k7vwm | 1/1 | Running | 4 (114m ago) | — |
| kuberay-worker-jobexecutorgroup-mrwmd | 1/1 | Running | 2 (35m ago) | — |
| kuberay-head-g4b42 | 1/2 | CrashLoopBackOff | 219 (20s ago) | — |
| kuberay-worker-ctrlgroup-sh8r6 | 1/1 | Running | 7 (58m ago) | — |
| kuberay-worker-ctrlgroup-cqn54 | 0/1 | Pending | 0 | — |
| kuberay-worker-ctrlgroup-g48st | 1/1 | Running | 8 (58m ago) | — |
| kuberay-worker-ctrlgroup-z29vg | 1/1 | Running | 6 (3h54m ago) | — |
| kuberay-worker-frontgroup-bmdnp | 1/1 | Running | 4 (128m ago) | — |
| kuberay-worker-frontgroup-2qlf7 | 1/1 | Running | 5 (159m ago) | — |
| kuberay-worker-frontgroup-f7pxs | 1/1 | Running | 7 (7h54m ago) | — |
| kuberay-worker-frontgroup-htbt5 | 1/1 | Running | 4 (107m ago) | — |
| kuberay-worker-frontgroup-q4svn | 1/1 | Running | 6 (8h ago) | — |
Versions / Dependencies
- Ray version:
2.34.0
- KubeRay version:
1.1.0
- Python version:
Python 3.10.2
- OS:
Euler OS 2.0 (aarch64)
- Kubernetes version:
Server Version: v1.30.6-r10-30.0.34.4-arm64
Reproduction script
values.yaml
global:
annotations:
ray.io/external-storage-namespace: my-raycluster-storage
ray.io/ft-enabled: 'true'
ray.io/overwrite-container-cmd: 'true'
head:
containerEnv:
- name: RAY_REDIS_ADDRESS
value: xxx:6379
- name: RAY_gcs_rpc_server_reconnect_timeout_s
value: "10"
livenessProbe:
exec:
command:
- bash
- '-c'
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://$POD_IP:8265/api/gcs_healthz | grep success
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
exec:
command:
- bash
- '-c'
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://$POD_IP:8265/api/gcs_healthz | grep success
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
workerGroups:
xxxGroup:
livenessProbe:
exec:
command:
- bash
- '-c'
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
exec:
command:
- bash
- '-c'
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
Issue Severity
This issue is blocking my work.