EC2 Autoscaler starts scaling down while scaling up

What happened:

  • I launched a job requiring some amount of nodes.
  • The job finished.
  • I launched another job requiring a larger amount of the same type of node.
  • Autoscaler started adding nodes to meet the higher demand - the difference between the new required amount and the existing amount of nodes.
  • While adding new nodes, autoscaler started removing existing nodes used in previous run and also immediately adding more new nodes to replace the removed ones, thus taking longer total time to scale up to the new demand.

What I would expect to happen:

  • While scaling up, autoscaler should reuse nodes from previous run.

Does this look like a bug?

Autoscaler log indicating a representative sequence of events after the first job has completed and the second job has been submitted:

(autoscaler +12s) Adding 11 node(s) of type ray.worker.m6i.large.
(autoscaler +34s) Removing 5 nodes of type ray.worker.m6i.large (idle).
(autoscaler +40s) Adding 5 node(s) of type ray.worker.m6i.large.
(autoscaler +40s) Removing 4 nodes of type ray.worker.m6i.large (idle).
(autoscaler +46s) Adding 4 node(s) of type ray.worker.m6i.large.
(autoscaler +46s) Resized to 22 CPUs.
(autoscaler +46s) Removing 5 nodes of type ray.worker.m6i.large (idle).
(autoscaler +52s) Adding 5 node(s) of type ray.worker.m6i.large.
(autoscaler +52s) Resized to 14 CPUs.
(autoscaler +52s) Removing 1 nodes of type ray.worker.m6i.large (idle).
(autoscaler +57s) Adding 1 node(s) of type ray.worker.m6i.large.
(autoscaler +57s) Resized to 4 CPUs.
(autoscaler +1m2s) Resized to 2 CPUs.
...

I am experiencing a similar issue when training on EC2 spot instances and one of the nodes gets a spot interruption. Looks like at that time the other nodes are discarded, so eventually all of the nodes get replaced with new ones instead of just the spot-interrupted one.

The following is an example log from training on 2 x g4dn.xlarge instances, at the moment when one of the nodes experiences spot interruption.

Note that the autoscaler reports removing the other node due to being “idle”, and scaling all the way down to just the head node (2 CPUs). A few seconds after though, it starts adding back an identical node.

ray.exceptions.ActorUnavailableError: The actor 7f88f2e34a8cdfd3854261390a000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: Socket closed; RPC Error details: . The task may or maynot have been executed on the actor.

Training errored after 40 iterations at 2025-02-04 00:19:52. Total running time: 40min 53s
Error file: /tmp/ray/session_2025-02-03_21-33-08_715945_68/artifacts/2025-02-03_23-38-58/experimental-2025-02-03_23-38-57_9340/driver_artifacts/TorchTrainer_0a4e2_00000_0_2025-02-03_23-38-58/error.txt
2025-02-04 00:19:52,926	WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 3.0 CPUs and 2.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(pid=gcs_server) [2025-02-04 00:19:57,820 E 73 73] gcs_placement_group_scheduler.cc:247: Failed to cancel resource reserved for bundle because the max retry count is reached. placement group id={b7abf2686486f358f5445256f81c0a000000}, bundle index={0} at node 8ebf6d01807adf6d2c8de2b6c378e3ed5d86a9761b3b769529b3008c
(autoscaler +41m8s) Adding 1 node(s) of type ray.worker.g4dn.xlarge.
(raylet) The node with node id: 8ebf6d01807adf6d2c8de2b6c378e3ed5d86a9761b3b769529b3008c and address: 10.212.106.7 and node name: 10.212.106.7 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload.
(autoscaler +41m13s) Resized to 6 CPUs, 1 GPUs.
(autoscaler +41m39s) Removing 1 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +41m44s) Adding 1 node(s) of type ray.worker.g4dn.xlarge.
(autoscaler +41m50s) Resized to 2 CPUs.

I would prefer if the node that was not interrupted was kept, to reduce the time until the training can resume.

This time, I have requested to train with 20 nodes, and the autoscaler is not even able to ever get there! It keep removing nodes while adding others.

This is the log. Note that the autoscaler is scaling up and down, and this is in the absence of any actions on my part:

...
(autoscaler +14s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +14s) Adding 4 node(s) of type ray.worker.g4dn.xlarge.
2025-02-04 00:42:06,322	WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2025-02-04 00:43:06,409	WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2025-02-04 00:44:06,492	WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +3m26s) Resized to 14 CPUs, 3 GPUs.
(autoscaler +3m32s) Resized to 18 CPUs, 4 GPUs.
(autoscaler +3m58s) Removing 4 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +4m3s) Adding 4 node(s) of type ray.worker.g4dn.xlarge.
2025-02-04 00:45:06,578	WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +4m8s) Resized to 2 CPUs.
2025-02-04 00:46:06,661	WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2025-02-04 00:47:06,745	WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +6m43s) Resized to 6 CPUs, 1 GPUs.
(autoscaler +6m49s) Resized to 14 CPUs, 3 GPUs.
(autoscaler +6m54s) Resized to 18 CPUs, 4 GPUs.
2025-02-04 00:48:06,825	WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +7m15s) Removing 1 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +7m20s) Adding 1 node(s) of type ray.worker.g4dn.xlarge.
(autoscaler +7m20s) Removing 2 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +7m26s) Adding 2 node(s) of type ray.worker.g4dn.xlarge.
(autoscaler +7m26s) Resized to 14 CPUs, 3 GPUs.
(autoscaler +7m26s) Removing 1 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +7m31s) Adding 1 node(s) of type ray.worker.g4dn.xlarge.
(autoscaler +7m31s) Resized to 6 CPUs, 1 GPUs.
(autoscaler +7m36s) Resized to 2 CPUs.

I wonder if a contributing factor to these issues is my very short value of idle_timeout_minutes: 0.5 in the config file.

If that is the case, I would consider this a bug in Ray. Especially in the case of training that is using a placement group, I would expect that while the placement group is in place, no existing node in it should ever be considered idle for the purpose of autoscaling.

I’ve realized the issue with 20 nodes never being provisioned was due to having max_workers: 5 in the cluster config for that node type.

I have confirmed that with this program:

import ray
from ray.util.placement_group import placement_group
ray.init()
print("Creating placement group...")
pg = placement_group([{"m6i.large.ondemand": 1}] * 20)
created = pg.wait(timeout_seconds=5 * 60)
print(f"Finished. Created = {created}")

With max_workers: 5 it exhibits the problematic behaviour of scaling down and up and never finishing:

Creating placement group...
(autoscaler +8s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +8s) Adding 5 node(s) of type ray.worker.m6i.large.ondemand.
(autoscaler +2m28s) Resized to 6 CPUs.
(autoscaler +2m33s) Resized to 12 CPUs.
(autoscaler +3m0s) Removing 3 nodes of type ray.worker.m6i.large.ondemand (idle).
(autoscaler +3m5s) Adding 3 node(s) of type ray.worker.m6i.large.ondemand.
(autoscaler +3m5s) Removing 2 nodes of type ray.worker.m6i.large.ondemand (idle).
(autoscaler +3m10s) Adding 2 node(s) of type ray.worker.m6i.large.ondemand.
(autoscaler +3m10s) Resized to 6 CPUs.
(autoscaler +3m16s) Resized to 2 CPUs.
...

With max_workers: 50 it does complete though.

So, while this is an error on my part of specifying impossible requirements versus constraints on the autoscaler, I would appreciate a more helpful error message saying that the requirements can never be satisfied, rather than scaling down and up and reporting some nodes as idle.

I would like to add that this still does not answer the unnecessary scaling down and up that I have reported on in the first 2 posts (e.g. when training on 2 nodes only and being interrupted).

Thanks for providing the detailed information about the issue and your debugging process!

Regarding the question about scaling up & down, the issue is because the short idle_timeout_minutes. (idle) at the end of the node removal log line indicated the reason of the node removal. It could be helpful to increase the idle_timeout_minutes to a larger value to avoid the thrashing.

And I think your ask about a clearer error message makes sense. Can you file an github issue with the ask to help us track the issue? Thanks!