Autoscaler endless loop of scheduling failure

bnorick · February 7, 2023, 6:32pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am using Ray 2.2.0. I have two questions:

Is there an easy way to get some logs out of the autoscaler? Specifically debug logs from resource_demand_scheduler. I started a cluster with ray up, then attached and manually edited the monitor.py file for autoscaler to have default --logging-level of debug. That was a bit cumbersome.
On an AWS cluster with multiple node types I am stuck in an endless loop of trying and failing to schedule nodes of the same type. It looks to me like NodeAvailabilityRecord.is_available isn’t even being read anywhere, so it’s no surprise that the autoscaler can’t react to the node type not being available. Is this correct?

bnorick · February 7, 2023, 7:18pm

Digging a bit deeper, it seems like a custom utilization scorer which actually leverages the node_availability_summary would be able to downweight node types which have failed to schedule recently?

Alex · February 7, 2023, 11:35pm

It’s currently expected behavior to keep retrying the same node type until it becomes available again in the future.

bnorick · February 7, 2023, 11:50pm

So would something like this work? Yes, I know I am using a few internal bits here…

from ray.autoscaler._private import node_provider_availability_tracker
from ray.autoscaler._private import util
from ray.autoscaler._private.resource_demand_scheduler import _default_utilization_scorer


def utilization_scorer(
    node_resources: util.ResourceDict,
    resources: list[util.ResourceDict],
    node_type: str,
    *,
    node_availability_summary: node_provider_availability_tracker.NodeAvailabilitySummary,
):
    if (
        (availability := node_availability_summary.node_availabilities.get(node_type))
        and not availability.is_available
    ):
        return None

    return _default_utilization_scorer(node_resources, resources, node_type, node_availability_summary=node_availability_summary)

bnorick · February 8, 2023, 8:42am

To answer my own question, yes this works.

If anyone else runs into the same issue and wants a resolution, the code above along with something like the following in your cluster_config.yaml should do the trick.

head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; env RAY_AUTOSCALER_UTILIZATION_SCORER=library.module.with.utilization_scorer ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

The result, when I gave the cluster some tasks to run, looked like

(scheduler +7s) Failed to launch 2 node(s) of type ray_worker_1. (InsufficientInstanceCapacity): There is no Spot capacity available that matches your request.
(scheduler +12s) Failed to launch 2 node(s) of type ray_worker_2. (InsufficientInstanceCapacity): There is no Spot capacity available that matches your request.
(scheduler +17s) Failed to launch 2 node(s) of type ray_worker_3. (SpotMaxPriceTooLow): Your Spot request price of X is lower than the minimum required Spot request fulfillment price of Y.
(scheduler +22s) Adding 2 node(s) of type ray_worker_4.

Presumably the node types which failed will be tried again, after the ttl of their status is over. This is controlled by AUTOSCALER_NODE_AVAILABILITY_MAX_STALENESS_S which defaults to 30 minutes (please correct me if I am wrong).

bnorick · February 8, 2023, 7:23pm

Now I’d like to get the autoscaler to attempt to schedule partial requests. That is, if I have max_workers as 5, I want it to try to schedule one at a time for a given node type rather than 5 directly, in case only some capacity is available.

If anyone has any ideas, let me know.

bnorick · February 8, 2023, 7:37pm

Looks like AUTOSCALER_MAX_LAUNCH_BATCH might be what I need.

ali1 · February 11, 2025, 2:36pm

hey @bnorick, thanks for the very helpful post – really insightful.

Just curious if you or any team members have any context (or can direct me to any tickets you may be tracking) around why the default autoscaling behavior seems to not consider these items. I didn’t seem to see it on the road map or current open issues on GitHub.

I’ve also noticed that hitting vCPU limits is separate from node availability – I’ll give your AUTOSCALER_MAX_LAUNCH_BATCH a try. Also curious to see if you have any suggestions around testing a custom utilization scorer besides adhoc booting up a bunch of nodes (i.e. if you’ve tried some kind of mocking solution).

Topic		Replies	Views
Moving on to next available node type when AWS spot capacity unavailable? Ray Clusters	1	539	December 15, 2021
EC2 Autoscaler starts scaling down while scaling up	7	34	February 21, 2025
Autoscaler not shutting down idle nodes. ray 1.3 Ray Clusters	20	1338	June 9, 2021
Slurm Autoscaler Ray Clusters	1	306	January 22, 2024
Ray tasks scheduling troubleshooting Ray Core	3	166	March 3, 2025

Autoscaler endless loop of scheduling failure

Related topics