How does Ray actor work?

We have a Ray cluster running, but today, for about 1.5 hours, the cluster was up, yet no Ray actors were running. During this time, users were unable to submit any Ray jobs.

Here is a simplified version of our deployment code:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: hm-ray-cluster
  namespace: production-hm-ray-cluster
  labels:
    app.kubernetes.io/name: hm-ray-cluster-deployment
    app.kubernetes.io/part-of: production-hm-ray-cluster
spec:
  rayVersion: 2.43.0
  # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
  gcsFaultToleranceOptions:
    redisAddress: redis://hm-ray-cluster-valkey-primary.production-hm-ray-cluster-valkey.svc:6379
    redisPassword:
      valueFrom:
        secretKeyRef:
          name: hm-ray-cluster-secret
          key: VALKEY_PASSWORD
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"
    template:
      spec:
        serviceAccountName: hm-ray-cluster-service-account
        # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
        restartPolicy: Never
        containers:
          - name: ray-head
            image: rayproject/ray:2.43.0-py312-cpu
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
            resources:
              requests:
                cpu: 1000m
                memory: 2Gi
              limits:
                cpu: 2000m
                memory: 4Gi
  workerGroupSpecs:
    - groupName: group-1
      replicas: 1
      minReplicas: 1
      maxReplicas: 100
      rayStartParams: {}
      template:
        spec:
          serviceAccountName: hm-ray-cluster-service-account
          # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
          restartPolicy: Never
          containers:
            - name: ray-worker
              image: rayproject/ray:2.43.0-py312-cpu
              resources:
                requests:
                  cpu: 15000m
                  memory: 60Gi
                limits:
                  cpu: 15000m
                  memory: 60Gi

I am wondering how Ray actors work, specifically the datasets_stats_actor (which uses the _StatsActor class). Can multiple Ray actors run simultaneously in a single Ray cluster, similar to a high-availability mode? Thanks!

1 Like

I’m facing similar problem. datasets_stats_actor and AutoscalingRequester keeps on running even after the training job is finished.

Strangely, these two actors are placed on GPU node which keeps expensive GPU node alive.

I have to ssh into the GPU node and run following script to kill these actors.

import ray
from ray.experimental.state.api import list_actors
for actor in list_actors():
        if actor['state'] == 'ALIVE':
                handler = ray.get_actor(actor['name'], actor['ray_namespace'])
                ray.kill(handler)