How to implement ad-hoc spot instance scaling?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
    • I’m not sure how to implement spot instance training w/o this

The current trainer API seems to work as follows: If the # of spot instances is less than the # of requested workers, wait until more spot instances come online to resume training.

This seems fine, but an alternative model would be to train with less spot instances, until the new spot instance comes online. This could be useful if–due to high demand–the number of available spot instances decreases semi-permanently.

I am trying to figure out how to implement this using Ray AIR. The first step would be to detect when the cluster size changes. How could I go about doing this?

Hi @Vedant_Roy, we’re scoping implementing elastic training internally right now - that said it will take some time to be fully implemented and available.

Monitoring the cluster size can be misleading, as some cluster resources might be taken up by other tasks - e.g. if you run hyperparameter optimization, a dying node might only affect one trial and not the others. For a general solution the approach is probably to detect actor failures (when a node goes down), then scheduling the actor again, and detecting when it becomes available (when either there is enough space in the cluster or another node comes up).