How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
- I’m not sure how to implement spot instance training w/o this
The current trainer API seems to work as follows: If the # of spot instances is less than the # of requested workers, wait until more spot instances come online to resume training.
This seems fine, but an alternative model would be to train with less spot instances, until the new spot instance comes online. This could be useful if–due to high demand–the number of available spot instances decreases semi-permanently.
I am trying to figure out how to implement this using Ray AIR. The first step would be to detect when the cluster size changes. How could I go about doing this?