WorkerPods Unexpectedly Restart When Injecting Failures into Ray HeadPod with GCS FT Enabled on K8S

infzo · April 21, 2025, 10:31am

What happened + What you expected to happen

We deployed a Ray cluster (v2.34.0) using KubeRay on K8S, with external Redis enabled to support GCS fault tolerance.
During reliability testing, we injected process failures into the HeadPod 219 times (kill -9 [raylet pid/gcs_server pid/etc.]). We observed that, with some probability, WorkerPods also restarted unexpectedly following a HeadPod process failure.
We expected only the HeadPod to be affected, with WorkerPods remaining stable.

Describe

Bug: Injecting process failures into the HeadPod may trigger restarts of WorkerPods as well.
Expected Behavior: Only the HeadPod should be affected by injected failures; WorkerPods should continue to operate normally.
Additional Observations:
- Lowering the value of the environment variable RAY_gcs_rpc_server_reconnect_timeout_s on the HeadPod reduces the likelihood of WorkerPod restarts, but does not completely eliminate them.

| NAME                                      | READY  | STATUS           | RESTARTS          | AGE   |
|-------------------------------------------|--------|------------------|-------------------|-------|
| kuberay-worker-jobexecutorgroup-hdnf2     | 1/1    | Running          | 9 (3h47m ago)     | —     |
| kuberay-worker-jobexecutorgroup-vvhza     | 1/1    | Running          | 3 (35m ago)       | —     |
| kuberay-worker-jobexecutorgroup-k7vwm     | 1/1    | Running          | 4 (114m ago)      | —     |
| kuberay-worker-jobexecutorgroup-mrwmd     | 1/1    | Running          | 2 (35m ago)       | —     |
| kuberay-head-g4b42                        | 1/2    | CrashLoopBackOff | 219 (20s ago)     | —     |
| kuberay-worker-ctrlgroup-sh8r6            | 1/1    | Running          | 7 (58m ago)       | —     |
| kuberay-worker-ctrlgroup-cqn54            | 0/1    | Pending          | 0                 | —     |
| kuberay-worker-ctrlgroup-g48st            | 1/1    | Running          | 8 (58m ago)       | —     |
| kuberay-worker-ctrlgroup-z29vg            | 1/1    | Running          | 6 (3h54m ago)     | —     |
| kuberay-worker-frontgroup-bmdnp           | 1/1    | Running          | 4 (128m ago)      | —     |
| kuberay-worker-frontgroup-2qlf7           | 1/1    | Running          | 5 (159m ago)      | —     |
| kuberay-worker-frontgroup-f7pxs           | 1/1    | Running          | 7 (7h54m ago)     | —     |
| kuberay-worker-frontgroup-htbt5           | 1/1    | Running          | 4 (107m ago)      | —     |
| kuberay-worker-frontgroup-q4svn           | 1/1    | Running          | 6 (8h ago)        | —     |

Versions / Dependencies

Ray version: 2.34.0
KubeRay version: 1.1.0
Python version: Python 3.10.2
OS: Euler OS 2.0 (aarch64)
Kubernetes version: Server Version: v1.30.6-r10-30.0.34.4-arm64

Reproduction script

values.yaml

global:
  annotations:
    ray.io/external-storage-namespace: my-raycluster-storage
    ray.io/ft-enabled: 'true'
    ray.io/overwrite-container-cmd: 'true'
head:
  containerEnv:
    - name: RAY_REDIS_ADDRESS
      value: xxx:6379
    - name: RAY_gcs_rpc_server_reconnect_timeout_s
      value: "10"
  livenessProbe:
    exec:
      command:
        - bash
        - '-c'
        - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://$POD_IP:8265/api/gcs_healthz | grep success
    failureThreshold: 3
    initialDelaySeconds: 30
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 1
  readinessProbe:
    exec:
      command:
        - bash
        - '-c'
        - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://$POD_IP:8265/api/gcs_healthz | grep success
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 1
workerGroups:
  xxxGroup:
    livenessProbe:
      exec:
        command:
          - bash
          - '-c'
          - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
      failureThreshold: 3
      initialDelaySeconds: 30
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    readinessProbe:
      exec:
        command:
          - bash
          - '-c'
          - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1

Issue Severity

This issue is blocking my work.

christina · April 25, 2025, 4:59pm

Hello! I see you’ve opened a GitHub issue, I think we will continue to track this issue there. Thanks for posting on both forums though!

Link for anyone else who is reading this: [RayCluster] WorkerPods Unexpectedly Restart When Injecting Failures into Ray HeadPod with GCS FT Enabled on K8S · Issue #52480 · ray-project/ray · GitHub

Topic		Replies	Views
Unable to recover from head-pod failure in k8s Ray Clusters	8	828	March 22, 2022
Ray head and ray training worker pods are crashing intermittently Kubernetes	3	178	August 9, 2024
Ray controller restart worker pod after head pod restart Kubernetes	0	395	November 19, 2023
Head pod does not restart after deleting/draining Kubernetes	7	796	August 9, 2022
Automatically restart head node on kubernetes Kubernetes	3	852	June 24, 2021