How to set Ray head node in high availability mode using KubeRay Helm chart?

Originally asked at How to set Ray head node in high availability mode using KubeRay Helm chart? - Stack Overflow

Here is a copy:


I am trying to set up high availability (HA) for Ray head node. Currently, if Ray head node is down, the Ray job running in this Ray cluster will fail and disappear.

To clarify, I am not using Ray Serve. I am only running some Ray jobs in a Ray cluster.

I deployed my Ray cluster by this KubeRay Helm chart.

Here is my deployment code:

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-hm-ray-cluster
  namespace: production-hm-argo-cd
  labels:
    app.kubernetes.io/name: hm-ray-cluster
spec:
  project: production-hm
  source:
    repoURL: https://ray-project.github.io/kuberay-helm
    # https://github.com/ray-project/kuberay/releases
    targetRevision: 1.3.0
    chart: ray-cluster
    helm:
      releaseName: hm-ray-cluster
      values: |
        # https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml
        ---
        image:
          tag: 2.41.0-py312-cpu
        head:
          serviceAccountName: hm-ray-cluster-service-account
          autoscalerOptions:
            upscalingMode: Default
            # Seconds
            idleTimeoutSeconds: 300
          resources:
            requests:
              cpu: 1000m
              memory: 8Gi
            limits:
              cpu: 4000m
              memory: 128Gi
        worker:
          replicas: 10
          minReplicas: 10
          maxReplicas: 100
          serviceAccountName: hm-ray-cluster-service-account
          resources:
            requests:
              cpu: 1000m
              memory: 8Gi
            limits:
              cpu: 4000m
              memory: 128Gi
  destination:
    namespace: production-hm-ray-cluster
    server: https://kubernetes.default.svc
  syncPolicy:
    syncOptions:
      - ServerSideApply=true
    automated:
      prune: true

I have read GCS fault tolerance in KubeRay. I feel I need set gcsFaultToleranceOptions, however, I didn’t find how to set it in Helm chart.

Assuming I have a high availability Redis cluster and can be accessed by redis.redis-namespace.svc:6379, how to set Ray head node in high availability mode using Helm chart?

I saw a similar question posted about 4 years ago at High Availability for Head node of Ray clusters, but there was no solution at the time.

Any guide would be appreciate. Thank you!