Originally asked at How to set Ray head node in high availability mode using KubeRay Helm chart? - Stack Overflow
Here is a copy:
I am trying to set up high availability (HA) for Ray head node. Currently, if Ray head node is down, the Ray job running in this Ray cluster will fail and disappear.
To clarify, I am not using Ray Serve. I am only running some Ray jobs in a Ray cluster.
I deployed my Ray cluster by this KubeRay Helm chart.
Here is my deployment code:
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-hm-ray-cluster
namespace: production-hm-argo-cd
labels:
app.kubernetes.io/name: hm-ray-cluster
spec:
project: production-hm
source:
repoURL: https://ray-project.github.io/kuberay-helm
# https://github.com/ray-project/kuberay/releases
targetRevision: 1.3.0
chart: ray-cluster
helm:
releaseName: hm-ray-cluster
values: |
# https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml
---
image:
tag: 2.41.0-py312-cpu
head:
serviceAccountName: hm-ray-cluster-service-account
autoscalerOptions:
upscalingMode: Default
# Seconds
idleTimeoutSeconds: 300
resources:
requests:
cpu: 1000m
memory: 8Gi
limits:
cpu: 4000m
memory: 128Gi
worker:
replicas: 10
minReplicas: 10
maxReplicas: 100
serviceAccountName: hm-ray-cluster-service-account
resources:
requests:
cpu: 1000m
memory: 8Gi
limits:
cpu: 4000m
memory: 128Gi
destination:
namespace: production-hm-ray-cluster
server: https://kubernetes.default.svc
syncPolicy:
syncOptions:
- ServerSideApply=true
automated:
prune: true
I have read GCS fault tolerance in KubeRay. I feel I need set gcsFaultToleranceOptions
, however, I didn’t find how to set it in Helm chart.
Assuming I have a high availability Redis cluster and can be accessed by redis.redis-namespace.svc:6379
, how to set Ray head node in high availability mode using Helm chart?
I saw a similar question posted about 4 years ago at High Availability for Head node of Ray clusters, but there was no solution at the time.
Any guide would be appreciate. Thank you!