1. Severity of the issue:
Medium: Significantly affects my productivity but can find a workaround.
2. Environment:
- Ray version: 2.48.0
- Python version: 3.12.9
- OS: ubuntu (from standard kuberay worker image rayproject/2.48.0-py312-aarch64)
- Cloud/Infrastructure: kubernates cluster
3. What happened vs. what you expected:
I’m running Ray Data batch jobs on a KubeRay cluster, and I’m having trouble getting the cluster to scale down to zero workers after a job completes.
-
Expected: after the Ray Data job finishes and the autoscaler’s idle timeout period (
idleTimeoutSeconds) passes, all Ray worker pods should be terminated, scaling the cluster down to zero active workers
-
Actual: after the job finishes, one worker pod consistently remains running indefinitely and is never terminated. Investigation shows this is because the internal
datasets_stats_actorremains alive on that node. The presence of this actor prevents the KubeRay autoscaler from considering the node idle, which blocks the final scale-down action
I have tried to disable this actor like this, but it doesn’t work
from ray.data import DataContext
ctx = DataContext.get_current()
ctx.enable_auto_log_stats = False
My Questions:
- What is the recommended, idempotent way to ensure a Ray Data job on KubeRay cleans up all its resources, including the
datasets_stats_actor, to allow a scale-down to zero? - Is there the way to ask
datasets_stats_actorto be created on head node, instead of worker?
