[Autoscaler] [Clusters] Understanding Ray Autoscaler with Transient Hardware

jonahrosenblum · April 4, 2021, 1:52am

Hello, I am trying to understand how/if Ray works with transient hardware i.e. AWS Spot Instances. Currently, I plan on running ray serve with replicas placed on transient machines. I saw that the autoscaler had options for specifying AWS hardware, I was wondering if I could specify that the hardware the replicas run on are transient, and have the autoscaler handle fault tolerance/re-deploying replicas if these instances are revoked?

eoakes · April 5, 2021, 4:13pm

Hey @jonahrosenblum, both the Ray autoscaler and Ray Serve should both gracefully handle this case. You can use spot instances for workers with the autoscaler and if they are brought down, the autoscaler will automatically spin up new instances. Ray Serve will also re-place the replicas in this case.

The one caveat here is you need to make sure that the head node is not a spot instance; if that machine is shut down, the whole cluster goes down.

jonahrosenblum · April 5, 2021, 4:41pm

Thank you! This lines up well with my understanding based on reading the source code. I have one more question if you don’t mind. Most cloud providers offer a 2-minute (or sometimes a 30 second) heads-up for when a spot instance is about to be revoked. Is there currently a way to use this warning signal to have Ray’s autoscaler start to spin up a replacement to account for this?

Topic		Replies	Views
Cluster crashes when using spot instances Ray Clusters	0	474	May 31, 2021
Ray cluster is stuck in creating worker nodes Ray Clusters	0	406	August 27, 2021
Auto Termination feature Ray Clusters	3	463	June 6, 2024
Ray Serve Autoscaling: Autoscaling backend-replicas removed? Ray Serve	3	494	February 18, 2021
Ray cluster raylet is down but the worker doesn't come back up Ray Clusters	1	411	November 3, 2022

[Autoscaler] [Clusters] Understanding Ray Autoscaler with Transient Hardware

Related topics