Ray Serve Autoscaling: Autoscaling backend-replicas removed?

JohannesAck · February 10, 2021, 9:02am

Hi,
I’m very new to model serving (and web things in general), so please excuse any inaccuracies.

I want to serve a model that requires a GPU to run using ray serve and want to use autoscaling to control the number of running replicas and nodes, based on the number of requests the model is receiving. Also when there are no requests I would like to shut down all nodes, as GPUs are quite expensive.
I found references to an experimental autoscaler for serve in the docs, and found the pull request adding it in the source files.

I have two specific questions:

It seems that it was possible to simply pass an autoscaling config via the backend config, but the references to this in the serve controller seem to have disappeared in the current master branch? So was this feature removed entirely or is there a different alternative now?
Does adjusting the number of replicas (automatically via the autoscaler) also automatically adjust the number of actors/nodes? I guess this would then be handled separately via the ray cluster autoscaler?

sangcho · February 11, 2021, 3:27am

cc @simon-mo Can you address his question?

simon-mo · February 16, 2021, 9:46pm

Hi @JohannesAck, thanks for the question. We are revamping our controller architecture now right so the autoscaling configuration is removed. Additionally, we are still designing a widely applicable autoscaling algorithm. If you can help us share your use case, it can help our design!

As for the scaling nodes, here’s how it will work:

ray serve scales the number of actors
ray core/ray autoscaler notice that resource demand > resource available, it will scale more nodes
new nodes join the ray cluster
ray serve’s actor get placed and become ready

JohannesAck · February 18, 2021, 7:10pm

Hi Simon, thanks for the answer.

I’m not sure if my use case is very representative, but I’m not just serving a single output of a tf/torch model but rather running a search over inputs of a model. The goal is to do text-to-image generation related to big sleep.
The main challenge here is that this takes quite a lot of compute per request, so that for each request I ideally want to start a new GPU instance, as generating an image completely occupies a single instance for up to a minute. I would then also like to quickly shut these instances down again soon to save cost.

Essentially a very expensive computation that needs a single GPU for itself and takes maybe a minute or so to run and the number of requests might vary wildly.

Currently I’m also still struggling to keep the start time sufficiently low, but I guess the solution there might be to simply not use docker but a machine image instead, but this also seems like it’s not specifically supported by ray. But I think this should be relatively straight forward to do on my own.

Anyways I’m looking forward to serve autoscaling coming back eventually and thanks for the work in general, ray is great!

Topic		Replies	Views
Autoscaling Replicas in Ray Serve Ray Serve	5	1720	March 12, 2021
Serve Controller high cpu usage when using autoscaling Ray Serve	0	351	June 16, 2023
Ray serve scale down strategy Ray Serve	3	499	February 24, 2022
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	742	November 20, 2023
Can I use ray autoscaler to control a manually launched ray cluster Kubernetes	3	580	July 15, 2021

Ray Serve Autoscaling: Autoscaling backend-replicas removed?

Related topics