WorkerPods Unexpectedly Restart When Injecting Failures into Ray HeadPod with GCS FT Enabled on K8S

What happened + What you expected to happen

We deployed a Ray cluster (v2.34.0) using KubeRay on K8S, with external Redis enabled to support GCS fault tolerance.
During reliability testing, we injected process failures into the HeadPod 219 times (kill -9 [raylet pid/gcs_server pid/etc.]). We observed that, with some probability, WorkerPods also restarted unexpectedly following a HeadPod process failure.
We expected only the HeadPod to be affected, with WorkerPods remaining stable.


Describe

  1. Bug: Injecting process failures into the HeadPod may trigger restarts of WorkerPods as well.
  2. Expected Behavior: Only the HeadPod should be affected by injected failures; WorkerPods should continue to operate normally.
  3. Additional Observations:
    • Lowering the value of the environment variable RAY_gcs_rpc_server_reconnect_timeout_s on the HeadPod reduces the likelihood of WorkerPod restarts, but does not completely eliminate them.
| NAME                                      | READY  | STATUS           | RESTARTS          | AGE   |
|-------------------------------------------|--------|------------------|-------------------|-------|
| kuberay-worker-jobexecutorgroup-hdnf2     | 1/1    | Running          | 9 (3h47m ago)     | —     |
| kuberay-worker-jobexecutorgroup-vvhza     | 1/1    | Running          | 3 (35m ago)       | —     |
| kuberay-worker-jobexecutorgroup-k7vwm     | 1/1    | Running          | 4 (114m ago)      | —     |
| kuberay-worker-jobexecutorgroup-mrwmd     | 1/1    | Running          | 2 (35m ago)       | —     |
| kuberay-head-g4b42                        | 1/2    | CrashLoopBackOff | 219 (20s ago)     | —     |
| kuberay-worker-ctrlgroup-sh8r6            | 1/1    | Running          | 7 (58m ago)       | —     |
| kuberay-worker-ctrlgroup-cqn54            | 0/1    | Pending          | 0                 | —     |
| kuberay-worker-ctrlgroup-g48st            | 1/1    | Running          | 8 (58m ago)       | —     |
| kuberay-worker-ctrlgroup-z29vg            | 1/1    | Running          | 6 (3h54m ago)     | —     |
| kuberay-worker-frontgroup-bmdnp           | 1/1    | Running          | 4 (128m ago)      | —     |
| kuberay-worker-frontgroup-2qlf7           | 1/1    | Running          | 5 (159m ago)      | —     |
| kuberay-worker-frontgroup-f7pxs           | 1/1    | Running          | 7 (7h54m ago)     | —     |
| kuberay-worker-frontgroup-htbt5           | 1/1    | Running          | 4 (107m ago)      | —     |
| kuberay-worker-frontgroup-q4svn           | 1/1    | Running          | 6 (8h ago)        | —     |

Versions / Dependencies

  • Ray version: 2.34.0
  • KubeRay version: 1.1.0
  • Python version: Python 3.10.2
  • OS: Euler OS 2.0 (aarch64)
  • Kubernetes version: Server Version: v1.30.6-r10-30.0.34.4-arm64

Reproduction script

values.yaml

global:
  annotations:
    ray.io/external-storage-namespace: my-raycluster-storage
    ray.io/ft-enabled: 'true'
    ray.io/overwrite-container-cmd: 'true'
head:
  containerEnv:
    - name: RAY_REDIS_ADDRESS
      value: xxx:6379
    - name: RAY_gcs_rpc_server_reconnect_timeout_s
      value: "10"
  livenessProbe:
    exec:
      command:
        - bash
        - '-c'
        - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://$POD_IP:8265/api/gcs_healthz | grep success
    failureThreshold: 3
    initialDelaySeconds: 30
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 1
  readinessProbe:
    exec:
      command:
        - bash
        - '-c'
        - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://$POD_IP:8265/api/gcs_healthz | grep success
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 1
workerGroups:
  xxxGroup:
    livenessProbe:
      exec:
        command:
          - bash
          - '-c'
          - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
      failureThreshold: 3
      initialDelaySeconds: 30
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    readinessProbe:
      exec:
        command:
          - bash
          - '-c'
          - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1

Issue Severity

This issue is blocking my work.

Hello! I see you’ve opened a GitHub issue, I think we will continue to track this issue there. Thanks for posting on both forums though! :slight_smile:

Link for anyone else who is reading this: [RayCluster] WorkerPods Unexpectedly Restart When Injecting Failures into Ray HeadPod with GCS FT Enabled on K8S · Issue #52480 · ray-project/ray · GitHub