How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I am using Ray 2.2.0. I have two questions:
Is there an easy way to get some logs out of the autoscaler? Specifically debug logs from resource_demand_scheduler. I started a cluster with ray up, then attached and manually edited the monitor.py file for autoscaler to have default --logging-level of debug. That was a bit cumbersome.
On an AWS cluster with multiple node types I am stuck in an endless loop of trying and failing to schedule nodes of the same type. It looks to me like NodeAvailabilityRecord.is_available isn’t even being read anywhere, so it’s no surprise that the autoscaler can’t react to the node type not being available. Is this correct?
Digging a bit deeper, it seems like a custom utilization scorer which actually leverages the node_availability_summary would be able to downweight node types which have failed to schedule recently?
If anyone else runs into the same issue and wants a resolution, the code above along with something like the following in your cluster_config.yaml should do the trick.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; env RAY_AUTOSCALER_UTILIZATION_SCORER=library.module.with.utilization_scorer ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
The result, when I gave the cluster some tasks to run, looked like
(scheduler +7s) Failed to launch 2 node(s) of type ray_worker_1. (InsufficientInstanceCapacity): There is no Spot capacity available that matches your request.
(scheduler +12s) Failed to launch 2 node(s) of type ray_worker_2. (InsufficientInstanceCapacity): There is no Spot capacity available that matches your request.
(scheduler +17s) Failed to launch 2 node(s) of type ray_worker_3. (SpotMaxPriceTooLow): Your Spot request price of X is lower than the minimum required Spot request fulfillment price of Y.
(scheduler +22s) Adding 2 node(s) of type ray_worker_4.
Presumably the node types which failed will be tried again, after the ttl of their status is over. This is controlled by AUTOSCALER_NODE_AVAILABILITY_MAX_STALENESS_S which defaults to 30 minutes (please correct me if I am wrong).
Now I’d like to get the autoscaler to attempt to schedule partial requests. That is, if I have max_workers as 5, I want it to try to schedule one at a time for a given node type rather than 5 directly, in case only some capacity is available.
hey @bnorick, thanks for the very helpful post – really insightful.
Just curious if you or any team members have any context (or can direct me to any tickets you may be tracking) around why the default autoscaling behavior seems to not consider these items. I didn’t seem to see it on the road map or current open issues on GitHub.
I’ve also noticed that hitting vCPU limits is separate from node availability – I’ll give your AUTOSCALER_MAX_LAUNCH_BATCH a try. Also curious to see if you have any suggestions around testing a custom utilization scorer besides adhoc booting up a bunch of nodes (i.e. if you’ve tried some kind of mocking solution).