Autoscaler endless loop of scheduling failure

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am using Ray 2.2.0. I have two questions:

  1. Is there an easy way to get some logs out of the autoscaler? Specifically debug logs from resource_demand_scheduler. I started a cluster with ray up, then attached and manually edited the monitor.py file for autoscaler to have default --logging-level of debug. That was a bit cumbersome.

  2. On an AWS cluster with multiple node types I am stuck in an endless loop of trying and failing to schedule nodes of the same type. It looks to me like NodeAvailabilityRecord.is_available isn’t even being read anywhere, so it’s no surprise that the autoscaler can’t react to the node type not being available. Is this correct?

Digging a bit deeper, it seems like a custom utilization scorer which actually leverages the node_availability_summary would be able to downweight node types which have failed to schedule recently?

It’s currently expected behavior to keep retrying the same node type until it becomes available again in the future.

So would something like this work? Yes, I know I am using a few internal bits here…

from ray.autoscaler._private import node_provider_availability_tracker
from ray.autoscaler._private import util
from ray.autoscaler._private.resource_demand_scheduler import _default_utilization_scorer


def utilization_scorer(
    node_resources: util.ResourceDict,
    resources: list[util.ResourceDict],
    node_type: str,
    *,
    node_availability_summary: node_provider_availability_tracker.NodeAvailabilitySummary,
):
    if (
        (availability := node_availability_summary.node_availabilities.get(node_type))
        and not availability.is_available
    ):
        return None

    return _default_utilization_scorer(node_resources, resources, node_type, node_availability_summary=node_availability_summary)

To answer my own question, yes this works.

If anyone else runs into the same issue and wants a resolution, the code above along with something like the following in your cluster_config.yaml should do the trick.

head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; env RAY_AUTOSCALER_UTILIZATION_SCORER=library.module.with.utilization_scorer ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

The result, when I gave the cluster some tasks to run, looked like

(scheduler +7s) Failed to launch 2 node(s) of type ray_worker_1. (InsufficientInstanceCapacity): There is no Spot capacity available that matches your request.
(scheduler +12s) Failed to launch 2 node(s) of type ray_worker_2. (InsufficientInstanceCapacity): There is no Spot capacity available that matches your request.
(scheduler +17s) Failed to launch 2 node(s) of type ray_worker_3. (SpotMaxPriceTooLow): Your Spot request price of X is lower than the minimum required Spot request fulfillment price of Y.
(scheduler +22s) Adding 2 node(s) of type ray_worker_4.

Presumably the node types which failed will be tried again, after the ttl of their status is over. This is controlled by AUTOSCALER_NODE_AVAILABILITY_MAX_STALENESS_S which defaults to 30 minutes (please correct me if I am wrong).

Now I’d like to get the autoscaler to attempt to schedule partial requests. That is, if I have max_workers as 5, I want it to try to schedule one at a time for a given node type rather than 5 directly, in case only some capacity is available.

If anyone has any ideas, let me know.

Looks like AUTOSCALER_MAX_LAUNCH_BATCH might be what I need.