I have several worker types specified under available_node_types
, some of which are designated spot instances. It appears that when there is no spot availability for the instance I’m looking for, the autoscaler keeps retrying to acquire that node type. It looks like it’s trying different availability_zones, but never moves on to the next node type and so my cluster doesn’t scale. Is there something that I can do to have the autoscaler move on to the next node type? Below is relevant portions of monitor.log
botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.2xlarge capacity in the Availability Zone you requested (us-east-1b). Our system will be working on provisioning additional capacity. You can currently get p3.2xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1d, us-east-1f.
2021-11-17 16:06:08,853 WARNING resource_demand_scheduler.py:730 -- The autoscaler could not find a node type to satisfy the request: [{'bundle_group_e71cadf899de8ecb8ca3cefdf62dfa97': 0.001}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
2021-11-17 16:06:08,853 INFO autoscaler.py:990 -- StandardAutoscaler: Queue 1 new nodes for launch
2021-11-17 16:06:08,855 INFO node_launcher.py:99 -- NodeLauncher1: Got 1 nodes to launch.
2021-11-17 16:06:08,921 INFO node_launcher.py:99 -- NodeLauncher1: Launching 1 nodes, type ray.worker.p3.small.
2021-11-17 16:06:08,957 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-11-17 16:06:08.957415 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray.worker.g5.base
1 ray.head.default
Pending:
ray.worker.p3.small, 1 launching
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
2.0/8.0 CPU (2.0 used of 2.0 reserved in placement groups)
2.0/2.0 GPU (2.0 used of 2.0 reserved in placement groups)
0.0/2.0 accelerator_type:A10G
0.00/20.462 GiB memory
0.00/9.254 GiB object_store_memory
Demands:
{'CPU': 1.0, 'GPU': 1.0} * 1 (PACK): 1+ pending placement groups
2021-11-17 16:06:09,008 INFO monitor.py:328 -- :event_summary:Adding 1 nodes of type ray.worker.p3.small.
2021-11-17 16:06:13,871 ERROR node_launcher.py:92 -- Launch failed