How to implement ad-hoc spot instance scaling?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
    • I’m not sure how to implement spot instance training w/o this

The current trainer API seems to work as follows: If the # of spot instances is less than the # of requested workers, wait until more spot instances come online to resume training.

This seems fine, but an alternative model would be to train with less spot instances, until the new spot instance comes online. This could be useful if–due to high demand–the number of available spot instances decreases semi-permanently.

I am trying to figure out how to implement this using Ray AIR. The first step would be to detect when the cluster size changes. How could I go about doing this?

Hi @Vedant_Roy, we’re scoping implementing elastic training internally right now - that said it will take some time to be fully implemented and available.

Monitoring the cluster size can be misleading, as some cluster resources might be taken up by other tasks - e.g. if you run hyperparameter optimization, a dying node might only affect one trial and not the others. For a general solution the approach is probably to detect actor failures (when a node goes down), then scheduling the actor again, and detecting when it becomes available (when either there is enough space in the cluster or another node comes up).

Hi @kai - Any ideas on what the current timelines look like for this? Or a Github issue that we can track (for design discussions etc.)? Or even if there’s scope to collaborate / contribute on this?
We’re looking to have elastic training (like torch elastic) on a Ray cluster hosted on Azure Computes.

Hi @svaruag @Vedant_Roy

we’ve deprioritized this for the time being, but let’s use this GitHub issue to drive the discussion: [rfc][Train] Support for elastic training / Discussion and requirements · Issue #20647 · ray-project/ray · GitHub

I think at first we should nail down the requirements and then see how we can support them best.