Some actors are alive even after job is finished or stopped

1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.

2. Environment:

  • Ray version: 2.49.0
  • Python version: 3.10
  • OS: Ubuntu 24.04
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected:
    • After a finished job or stopping a job with ray job stop command, the cluster should scale down on workers
  • Actual:
    • After stopping a job with ray job stop command, workers in cluster are alive because I think some of the workers are alive.

I’m running a training job on EKS using kuberay. Sometimes spawn up workers are alive even after a job is finished or stopped.

I’ve noticed that following two actors are always alive whenever I have noticed the problem.

datasets_stats_actor
AutoscalingRequester

I have to manually ssh into the machines and run following script to kill the actors. After that machines are released.

import ray
from ray.experimental.state.api import list_actors
for actor in list_actors():
        if actor['state'] == 'ALIVE':
                handler = ray.get_actor(actor['name'], actor['ray_namespace'])
                ray.kill(handler)

Does anyone know what these two actors do and why they are alive on GPU node?

Regards

Mitul