1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.
2. Environment:
- Ray version: 2.49.0
- Python version: 3.10
- OS: Ubuntu 24.04
- Cloud/Infrastructure: AWS
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected:
- After a finished job or stopping a job with
ray job stop
command, the cluster should scale down on workers
- After a finished job or stopping a job with
- Actual:
- After stopping a job with
ray job stop
command, workers in cluster are alive because I think some of the workers are alive.
- After stopping a job with
I’m running a training job on EKS using kuberay. Sometimes spawn up workers are alive even after a job is finished or stopped.
I’ve noticed that following two actors are always alive whenever I have noticed the problem.
datasets_stats_actor
AutoscalingRequester
I have to manually ssh into the machines and run following script to kill the actors. After that machines are released.
import ray
from ray.experimental.state.api import list_actors
for actor in list_actors():
if actor['state'] == 'ALIVE':
handler = ray.get_actor(actor['name'], actor['ray_namespace'])
ray.kill(handler)
Does anyone know what these two actors do and why they are alive on GPU node?
Regards
Mitul