Autoscaling Replicas in Ray Serve

Hi all,

I’m trying to deploy a few models using Ray Serve and Ray cluster on AWS. From what I’ve figured, creating a backend and endpoint using Ray Serve requires explicitly setting the resources that I need. So my questions are:

  • I would like to autoscale the number of model backend replicas, and thus the resources/instances, depending on how much demand there is. Ideally the scaled replicas should be reading from the same worker queue. Is that possible?
  • If I use ray clusters, is it possible to assign a specific model backend to a specific type of node, so they are scaled separately? If yes, how do I specify this? Eg. I would like model A to be scaled up only when the demand is high for model A, and model B to be scaled up when demand is high for B.

The current design is that there are 3 ray backends, general processing, model A and model B such that the http requests go to general processing before it is sent to either A or B to do inference.

Apologies if I’m missing out on any obvious facts, I’m quite new to Ray and deployment ops in general. Any advice would be appreciated, thanks!

1 Like

Hi @maxmiel, thanks for the question! Ray Serve does not currently support automated autoscaling of backends, but you can definitely build this as a thin layer on top. You can dynamically scale the number of replicas for a given backend by changing num_replicas using update_config(). You could build a thin layer that monitoring load/metrics from the system and dynamically updates the number of replicas by calling this API. Hope that helps.

Regarding the node types, you could accomplish this by creating a custom resource on the nodes (with the resources field in the autoscaler config) and then targeting that with your backends by passing a resource requirement to ray_actor_options.

1 Like

Thanks for the fast reply @eoakes ! If I have a backend with multiple replicas on a node, and that node is scaled through ray cluster, will they still be working on the same queue? I’m a little confused about the load balancing across ray serve workers and across ray cluster nodes.

1 Like

Yeah, in experimenting with ray 1.2 (single node, no gpus, just cpus on my laptop), I created a simple model server (pytorch/resnet) fronted bby fastAPI based on sample code/examples and created 2 replicas . If I hit the server with multiple concurrent requests, only 2 could get handled at a time as expected. I then decided to play around and while I’ll still have the 2 replicas for the models, I then placed the inference code in a new separate function outside the model class
@ray.remote
def inference_ext(self, image):
# inference code

and then in the ImageModel class “call” method, I called the above method.
In looking at the ray dashboard, I saw additional workers spin up for the work being done in the inference_ext method and then go idle. NOW, is this the “right” way to do a simple auto scale on a single node? IDK, it was just a quick experiment. The negative that I saw with this is that for a single inference request (or 2), the inference processing time was 75%-125% longer when I moved the processing out the ray.remote method. I suspect that this is due to the overhead to put this in a new process, but that is mere speculation and those numbers should be considered quick test observations, not a true analysis

1 Like

Thanks for the suggestion. I did give it a try, but it spawned a few other issues.

Didn’t solve it, but for whoever that might be interested:

  1. using @puntime_error suggestion-- I couldn’t find a nice way to kill the processes from inside the function (after they finished inference), so the IDLE processes were continuously created and still hogged ram after processing. If I tried to run a remote task each time, it would run out of memory and crash the instance.

  2. To kill the remote task from the inside, I used an actor class instead, which allowed me to use ray.actor.exit_actor() after inference. This somewhat worked, but had the overhead of starting/killing a lot of actors over time. Also, since I was trying to autoscale nodes with this, processing 10 queries in the serve queue also it meant launching numerous actor instances at the same time and waiting for them to be setup before it can do inferences.

  3. Tried to manually do a ray actor+job queue myself to manage/autoscale how many actors there are. This should have worked but I ran into some unknown errors that killed the actors immediately after creation. I think this might be the best way to do it at the moment, but I couldn’t debug the error. Spent some time debugging, maybe I was looking in the wrong place but I found nothing useful in the logs, the actors literally disappeared from my dashboard within seconds with no errors shown. :confused:

  4. Attempted eoakes 's suggestion but it seems that the resource demands for replicas only extend to the node and not the whole cluster/multiple nodes, so the script exits with an error saying that it couldn’t scale the replicas. We’re working on small instances 2cpu/8gram and ideally we can spawn numerous of these from autoscaling, so I was hoping 1 backend can extend across multiple nodes, but apparently not?

those are great experiments and data points…thx for sharing. I really don’t have any other suggestions as I am new to ray.io and figuring things out as I go. I suspect that as ray.io grows in popularity and acceptance, these issues will become more relevant to the community and things like auto-scaling backend/models will become easier to do directly within ray. I do wonder if you built an architecture on top of kubernetes if you could use the deployments of worker nodes and their pods to to scale up and down the ray backends. I haven’t given it much thought other than the sentence I just wrote.

1 Like