1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
Ray version: 2.51
Python version: 3.11
OS: Linux
Cloud/Infrastructure: Kubernetes
Other libs/tools (if relevant):
3. What happened vs. what you expected:
I am running a training on an autoscalable cluster. The training itself doesn’t require many resources but the autoscaler keeps requesting more and more cpus. The consequence is that a lot of workers are created but they are idling.
Actual: I have many pre-processing workers being created and idling for no clear reason
Expected: no autoscaling at all
I am mostly interested in understanding why the autoscaler is trying to bring more CPU online despite no real need?
It keeps growing the CPU request over time:
Resized to 1084 CPUs, 1 GPUs.
Resized to 1212 CPUs, 1 GPUs.
... etc
This is a known issue with Ray Data and the autoscaler: Ray Data can trigger excessive autoscaling, causing the cluster to keep adding CPU nodes even when they are not needed, resulting in many idle workers. This happens because Ray Data’s autoscaling logic may launch many tasks (e.g., during repartition or map operations) and aggressively requests CPUs, sometimes not properly backpressuring or limiting the number of concurrent tasks, which leads to over-provisioning of resources. The autoscaler reacts to these resource requests, not actual CPU utilization, so it keeps scaling up as long as it sees resource requests—even if the tasks themselves are not using the CPUs efficiently or are already completed. This behavior is documented in several open issues and discussions, and there is currently no complete fix; the main workaround is to cap the maximum number of CPU nodes in your cluster configuration or adjust your data pipeline to avoid triggering excessive parallelism (e.g., by limiting partitions or concurrency) (Ray Data/LLM Non stop CPU autoscaling with vLLM, Ray Data scaleup unlimited resources when processing data, Ray keeps adding nodes beyond Dataset.map concurrency).
The autoscaler does not scale based on actual CPU usage but on resource requests generated by Ray tasks and actors. If your pipeline or Ray Data job creates many tasks or partitions, the autoscaler interprets this as a need for more CPUs, regardless of real utilization. To prevent this, you can set a hard limit on the number of worker nodes or CPUs in your Ray cluster config, and consider tuning your data pipeline to reduce the number of concurrent tasks or blocks (Ray Cluster Key Concepts). Would you like more detail on how to adjust your pipeline or cluster config to mitigate this?