Hello, I am trying to understand how/if Ray works with transient hardware i.e. AWS Spot Instances. Currently, I plan on running ray serve with replicas placed on transient machines. I saw that the autoscaler had options for specifying AWS hardware, I was wondering if I could specify that the hardware the replicas run on are transient, and have the autoscaler handle fault tolerance/re-deploying replicas if these instances are revoked?
Hey @jonahrosenblum, both the Ray autoscaler and Ray Serve should both gracefully handle this case. You can use spot instances for workers with the autoscaler and if they are brought down, the autoscaler will automatically spin up new instances. Ray Serve will also re-place the replicas in this case.
The one caveat here is you need to make sure that the head node is not a spot instance; if that machine is shut down, the whole cluster goes down.
Thank you! This lines up well with my understanding based on reading the source code. I have one more question if you don’t mind. Most cloud providers offer a 2-minute (or sometimes a 30 second) heads-up for when a spot instance is about to be revoked. Is there currently a way to use this warning signal to have Ray’s autoscaler start to spin up a replacement to account for this?