How to set Ray head node in high availability mode using KubeRay Helm chart?

Hongbo-Miao · February 26, 2025, 7:16am

Originally asked at How to set Ray head node in high availability mode using KubeRay Helm chart? - Stack Overflow

Here is a copy:

I am trying to set up high availability (HA) for Ray head node. Currently, if Ray head node is down, the Ray job running in this Ray cluster will fail and disappear.

To clarify, I am not using Ray Serve. I am only running some Ray jobs in a Ray cluster.

I deployed my Ray cluster by this KubeRay Helm chart.

Here is my deployment code:

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-hm-ray-cluster
  namespace: production-hm-argo-cd
  labels:
    app.kubernetes.io/name: hm-ray-cluster
spec:
  project: production-hm
  source:
    repoURL: https://ray-project.github.io/kuberay-helm
    # https://github.com/ray-project/kuberay/releases
    targetRevision: 1.3.0
    chart: ray-cluster
    helm:
      releaseName: hm-ray-cluster
      values: |
        # https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml
        ---
        image:
          tag: 2.41.0-py312-cpu
        head:
          serviceAccountName: hm-ray-cluster-service-account
          autoscalerOptions:
            upscalingMode: Default
            # Seconds
            idleTimeoutSeconds: 300
          resources:
            requests:
              cpu: 1000m
              memory: 8Gi
            limits:
              cpu: 4000m
              memory: 128Gi
        worker:
          replicas: 10
          minReplicas: 10
          maxReplicas: 100
          serviceAccountName: hm-ray-cluster-service-account
          resources:
            requests:
              cpu: 1000m
              memory: 8Gi
            limits:
              cpu: 4000m
              memory: 128Gi
  destination:
    namespace: production-hm-ray-cluster
    server: https://kubernetes.default.svc
  syncPolicy:
    syncOptions:
      - ServerSideApply=true
    automated:
      prune: true

I have read GCS fault tolerance in KubeRay. I feel I need set gcsFaultToleranceOptions, however, I didn’t find how to set it in Helm chart.

Assuming I have a high availability Redis cluster and can be accessed by redis.redis-namespace.svc:6379, how to set Ray head node in high availability mode using Helm chart?

I saw a similar question posted about 4 years ago at High Availability for Head node of Ray clusters, but there was no solution at the time.

Any guide would be appreciate. Thank you!

Topic		Replies	Views
High availability for Ray Serve in 2022 (head node) Ray Serve	3	1362	September 1, 2022
Ray Serve Pods Scheduling Failing Ray Serve	3	94	July 26, 2024
Start cluster with multiple head node Ray Core	4	952	February 22, 2023
[Cluster, Serve] Is it possible to configure cluster fault tolerance without `ray up`? Ray Clusters	0	158	January 11, 2024
Ray Serve Head fault tolerance Ray Serve	3	332	October 13, 2023

How to set Ray head node in high availability mode using KubeRay Helm chart?

Related topics