This time, I have requested to train with 20 nodes, and the autoscaler is not even able to ever get there! It keep removing nodes while adding others.
This is the log. Note that the autoscaler is scaling up and down, and this is in the absence of any actions on my part:
...
(autoscaler +14s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +14s) Adding 4 node(s) of type ray.worker.g4dn.xlarge.
2025-02-04 00:42:06,322 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2025-02-04 00:43:06,409 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2025-02-04 00:44:06,492 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +3m26s) Resized to 14 CPUs, 3 GPUs.
(autoscaler +3m32s) Resized to 18 CPUs, 4 GPUs.
(autoscaler +3m58s) Removing 4 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +4m3s) Adding 4 node(s) of type ray.worker.g4dn.xlarge.
2025-02-04 00:45:06,578 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +4m8s) Resized to 2 CPUs.
2025-02-04 00:46:06,661 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
2025-02-04 00:47:06,745 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +6m43s) Resized to 6 CPUs, 1 GPUs.
(autoscaler +6m49s) Resized to 14 CPUs, 3 GPUs.
(autoscaler +6m54s) Resized to 18 CPUs, 4 GPUs.
2025-02-04 00:48:06,825 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 21.0 CPUs and 20.0 GPUs, but the cluster only has 2.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +7m15s) Removing 1 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +7m20s) Adding 1 node(s) of type ray.worker.g4dn.xlarge.
(autoscaler +7m20s) Removing 2 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +7m26s) Adding 2 node(s) of type ray.worker.g4dn.xlarge.
(autoscaler +7m26s) Resized to 14 CPUs, 3 GPUs.
(autoscaler +7m26s) Removing 1 nodes of type ray.worker.g4dn.xlarge (idle).
(autoscaler +7m31s) Adding 1 node(s) of type ray.worker.g4dn.xlarge.
(autoscaler +7m31s) Resized to 6 CPUs, 1 GPUs.
(autoscaler +7m36s) Resized to 2 CPUs.