How to implement ad-hoc spot instance scaling?

Vedant_Roy · October 24, 2022, 6:01pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.
- I’m not sure how to implement spot instance training w/o this

The current trainer API seems to work as follows: If the # of spot instances is less than the # of requested workers, wait until more spot instances come online to resume training.

This seems fine, but an alternative model would be to train with less spot instances, until the new spot instance comes online. This could be useful if–due to high demand–the number of available spot instances decreases semi-permanently.

I am trying to figure out how to implement this using Ray AIR. The first step would be to detect when the cluster size changes. How could I go about doing this?

kai · October 27, 2022, 3:54pm

Hi @Vedant_Roy, we’re scoping implementing elastic training internally right now - that said it will take some time to be fully implemented and available.

Monitoring the cluster size can be misleading, as some cluster resources might be taken up by other tasks - e.g. if you run hyperparameter optimization, a dying node might only affect one trial and not the others. For a general solution the approach is probably to detect actor failures (when a node goes down), then scheduling the actor again, and detecting when it becomes available (when either there is enough space in the cluster or another node comes up).

svaruag · January 11, 2023, 9:52pm

Hi @kai - Any ideas on what the current timelines look like for this? Or a Github issue that we can track (for design discussions etc.)? Or even if there’s scope to collaborate / contribute on this?
We’re looking to have elastic training (like torch elastic) on a Ray cluster hosted on Azure Computes.

kai · February 15, 2023, 6:04pm

Hi @svaruag @Vedant_Roy

we’ve deprioritized this for the time being, but let’s use this GitHub issue to drive the discussion: [rfc][Train] Support for elastic training / Discussion and requirements · Issue #20647 · ray-project/ray · GitHub

I think at first we should nail down the requirements and then see how we can support them best.

Topic		Replies	Views
Cluster crashes when using spot instances Ray Clusters	0	474	May 31, 2021
[Autoscaler] [Clusters] Understanding Ray Autoscaler with Transient Hardware Ray Clusters	2	362	April 5, 2021
Ray Cluster seem to be spawning less nodes than it should Ray Clusters	8	305	August 28, 2024
Multiple available_node_types, some spot, some non-spot Ray Clusters	4	92	August 6, 2024
EC2 Autoscaler starts scaling down while scaling up	7	34	February 21, 2025

How to implement ad-hoc spot instance scaling?

Related topics