Some actors are alive even after job is finished or stopped

1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.

2. Environment:

  • Ray version: 2.49.0
  • Python version: 3.10
  • OS: Ubuntu 24.04
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected:
    • After a finished job or stopping a job with ray job stop command, the cluster should scale down on workers
  • Actual:
    • After stopping a job with ray job stop command, workers in cluster are alive because I think some of the workers are alive.

I’m running a training job on EKS using kuberay. Sometimes spawn up workers are alive even after a job is finished or stopped.

I’ve noticed that following two actors are always alive whenever I have noticed the problem.

datasets_stats_actor
AutoscalingRequester

I have to manually ssh into the machines and run following script to kill the actors. After that machines are released.

import ray
from ray.experimental.state.api import list_actors
for actor in list_actors():
        if actor['state'] == 'ALIVE':
                handler = ray.get_actor(actor['name'], actor['ray_namespace'])
                ray.kill(handler)

Does anyone know what these two actors do and why they are alive on GPU node?

Regards

Mitul

i had a similar issue, seems to be related to GCS outages, and trying to schedule too many actors at once, i ended up increasing the timeouts for a bunch of the health checks and it helped a lot

1 Like

Yes, GCS (Global Control Store) outages or high load can cause issues when scheduling many actors at once, leading to timeouts or failures. Increasing health check and reconnect timeouts is a common mitigation and is supported by Ray’s internal configuration parameters such as gcs_rpc_server_reconnect_timeout_s, health_check_period_ms, and health_check_failure_threshold. These can be set via Ray’s system config to make the cluster more tolerant to transient GCS unavailability or heavy scheduling load. This approach is documented in Ray’s test suites and discussed in community threads as an effective workaround for such issues, especially in large or bursty actor workloads. For example, setting gcs_rpc_server_reconnect_timeout_s to a higher value allows more time for GCS to recover before workers are marked dead or jobs fail, and increasing health_check_period_ms or health_check_failure_threshold can reduce false positives in node failure detection during GCS hiccups.

Would you like more detail on which specific parameters to tune and how to set them?

Sources:

Hint: Mention @RunLLM in the post for followups.