We have a Ray cluster running, but today, for about 1.5 hours, the cluster was up, yet no Ray actors were running. During this time, users were unable to submit any Ray jobs.
Here is a simplified version of our deployment code:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: hm-ray-cluster
namespace: production-hm-ray-cluster
labels:
app.kubernetes.io/name: hm-ray-cluster-deployment
app.kubernetes.io/part-of: production-hm-ray-cluster
spec:
rayVersion: 2.43.0
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
gcsFaultToleranceOptions:
redisAddress: redis://hm-ray-cluster-valkey-primary.production-hm-ray-cluster-valkey.svc:6379
redisPassword:
valueFrom:
secretKeyRef:
name: hm-ray-cluster-secret
key: VALKEY_PASSWORD
headGroupSpec:
rayStartParams:
num-cpus: "0"
template:
spec:
serviceAccountName: hm-ray-cluster-service-account
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
restartPolicy: Never
containers:
- name: ray-head
image: rayproject/ray:2.43.0-py312-cpu
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
workerGroupSpecs:
- groupName: group-1
replicas: 1
minReplicas: 1
maxReplicas: 100
rayStartParams: {}
template:
spec:
serviceAccountName: hm-ray-cluster-service-account
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
restartPolicy: Never
containers:
- name: ray-worker
image: rayproject/ray:2.43.0-py312-cpu
resources:
requests:
cpu: 15000m
memory: 60Gi
limits:
cpu: 15000m
memory: 60Gi
I am wondering how Ray actors work, specifically the datasets_stats_actor
(which uses the _StatsActor
class). Can multiple Ray actors run simultaneously in a single Ray cluster, similar to a high-availability mode? Thanks!