How does Ray actor work?

Hongbo-Miao · April 2, 2025, 9:38pm

We have a Ray cluster running, but today, for about 1.5 hours, the cluster was up, yet no Ray actors were running. During this time, users were unable to submit any Ray jobs.

Here is a simplified version of our deployment code:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: hm-ray-cluster
  namespace: production-hm-ray-cluster
  labels:
    app.kubernetes.io/name: hm-ray-cluster-deployment
    app.kubernetes.io/part-of: production-hm-ray-cluster
spec:
  rayVersion: 2.43.0
  # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
  gcsFaultToleranceOptions:
    redisAddress: redis://hm-ray-cluster-valkey-primary.production-hm-ray-cluster-valkey.svc:6379
    redisPassword:
      valueFrom:
        secretKeyRef:
          name: hm-ray-cluster-secret
          key: VALKEY_PASSWORD
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"
    template:
      spec:
        serviceAccountName: hm-ray-cluster-service-account
        # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
        restartPolicy: Never
        containers:
          - name: ray-head
            image: rayproject/ray:2.43.0-py312-cpu
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
            resources:
              requests:
                cpu: 1000m
                memory: 2Gi
              limits:
                cpu: 2000m
                memory: 4Gi
  workerGroupSpecs:
    - groupName: group-1
      replicas: 1
      minReplicas: 1
      maxReplicas: 100
      rayStartParams: {}
      template:
        spec:
          serviceAccountName: hm-ray-cluster-service-account
          # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
          restartPolicy: Never
          containers:
            - name: ray-worker
              image: rayproject/ray:2.43.0-py312-cpu
              resources:
                requests:
                  cpu: 15000m
                  memory: 60Gi
                limits:
                  cpu: 15000m
                  memory: 60Gi

I am wondering how Ray actors work, specifically the datasets_stats_actor (which uses the _StatsActor class). Can multiple Ray actors run simultaneously in a single Ray cluster, similar to a high-availability mode? Thanks!

Topic		Replies	Views
Kuberay cluster not create worker pods after ray operator update to 1.1.0 Kubernetes	0	421	March 29, 2024
(k8s) Ray Operator + Ray Client example seems to not use all pods Kubernetes	1	442	June 25, 2021
Kubernetes cluster only creates head node Ray Clusters	11	767	June 7, 2022
Ray operator + client-server + autoscaling + openshift Kubernetes	11	1661	February 11, 2021
Ray on k8s, how to properly config head node Ray Clusters	4	885	June 24, 2022

How does Ray actor work?

Related topics