Moving on to next available node type when AWS spot capacity unavailable?

matthew.cox · November 17, 2021, 4:19pm

I have several worker types specified under available_node_types, some of which are designated spot instances. It appears that when there is no spot availability for the instance I’m looking for, the autoscaler keeps retrying to acquire that node type. It looks like it’s trying different availability_zones, but never moves on to the next node type and so my cluster doesn’t scale. Is there something that I can do to have the autoscaler move on to the next node type? Below is relevant portions of monitor.log

botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.2xlarge capacity in the Availability Zone you requested (us-east-1b). Our system will be working on provisioning additional capacity. You can currently get p3.2xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1d, us-east-1f.
2021-11-17 16:06:08,853	WARNING resource_demand_scheduler.py:730 -- The autoscaler could not find a node type to satisfy the request: [{'bundle_group_e71cadf899de8ecb8ca3cefdf62dfa97': 0.001}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
2021-11-17 16:06:08,853	INFO autoscaler.py:990 -- StandardAutoscaler: Queue 1 new nodes for launch
2021-11-17 16:06:08,855	INFO node_launcher.py:99 -- NodeLauncher1: Got 1 nodes to launch.
2021-11-17 16:06:08,921	INFO node_launcher.py:99 -- NodeLauncher1: Launching 1 nodes, type ray.worker.p3.small.
2021-11-17 16:06:08,957	INFO autoscaler.py:267 --
======== Autoscaler status: 2021-11-17 16:06:08.957415 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.worker.g5.base
 1 ray.head.default
Pending:
 ray.worker.p3.small, 1 launching
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 2.0/8.0 CPU (2.0 used of 2.0 reserved in placement groups)
 2.0/2.0 GPU (2.0 used of 2.0 reserved in placement groups)
 0.0/2.0 accelerator_type:A10G
 0.00/20.462 GiB memory
 0.00/9.254 GiB object_store_memory

Demands:
 {'CPU': 1.0, 'GPU': 1.0} * 1 (PACK): 1+ pending placement groups
2021-11-17 16:06:09,008	INFO monitor.py:328 -- :event_summary:Adding 1 nodes of type ray.worker.p3.small.
2021-11-17 16:06:13,871	ERROR node_launcher.py:92 -- Launch failed

hamlinkn · December 15, 2021, 9:44pm

This is a great question! I’m also interested in more information on what kind of intelligence the autoscaler has in relation to using multiple spot node types and when/how it selects them.

Topic		Replies	Views
Multiple available_node_types, some spot, some non-spot Ray Clusters	4	87	August 6, 2024
Autoscaler endless loop of scheduling failure Ray Clusters	7	640	February 11, 2025
EC2 Autoscaler starts scaling down while scaling up	7	34	February 21, 2025
Ray cluster is stuck in creating worker nodes Ray Clusters	0	405	August 27, 2021
RayServe Autoscaling not creating Ray Pods Ray Serve	3	290	March 29, 2024

Moving on to next available node type when AWS spot capacity unavailable?

Related topics