How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I am using Ray 2.2.0. I have two questions:
Is there an easy way to get some logs out of the autoscaler? Specifically debug logs from resource_demand_scheduler. I started a cluster with ray up, then attached and manually edited the monitor.py file for autoscaler to have default --logging-level of debug. That was a bit cumbersome.
On an AWS cluster with multiple node types I am stuck in an endless loop of trying and failing to schedule nodes of the same type. It looks to me like NodeAvailabilityRecord.is_available isn’t even being read anywhere, so it’s no surprise that the autoscaler can’t react to the node type not being available. Is this correct?
Digging a bit deeper, it seems like a custom utilization scorer which actually leverages the node_availability_summary would be able to downweight node types which have failed to schedule recently?
If anyone else runs into the same issue and wants a resolution, the code above along with something like the following in your cluster_config.yaml should do the trick.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; env RAY_AUTOSCALER_UTILIZATION_SCORER=library.module.with.utilization_scorer ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
The result, when I gave the cluster some tasks to run, looked like
(scheduler +7s) Failed to launch 2 node(s) of type ray_worker_1. (InsufficientInstanceCapacity): There is no Spot capacity available that matches your request.
(scheduler +12s) Failed to launch 2 node(s) of type ray_worker_2. (InsufficientInstanceCapacity): There is no Spot capacity available that matches your request.
(scheduler +17s) Failed to launch 2 node(s) of type ray_worker_3. (SpotMaxPriceTooLow): Your Spot request price of X is lower than the minimum required Spot request fulfillment price of Y.
(scheduler +22s) Adding 2 node(s) of type ray_worker_4.
Presumably the node types which failed will be tried again, after the ttl of their status is over. This is controlled by AUTOSCALER_NODE_AVAILABILITY_MAX_STALENESS_S which defaults to 30 minutes (please correct me if I am wrong).
Now I’d like to get the autoscaler to attempt to schedule partial requests. That is, if I have max_workers as 5, I want it to try to schedule one at a time for a given node type rather than 5 directly, in case only some capacity is available.