Scale Multiple Ray Serve Deployments Proportionally

tzhang · May 14, 2024, 6:57pm

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I created a ML API application with ray serve and there are the following components (simplified):

APIDeployment: for API request handling, calls other MLDeployments
MLDeployment1: a light weight model (i.e. embedding model)
MLDeployment2: a larger model (i.e. LLM)

The request will sequentially call MLDeployment 1 and then 2 to get the results. Now I’m working on an autoscale deployment for this application and I’m a bit confused and would like to get some suggestions/advice.

With only one deployment in the app (i.e. model inference and API in the same deployment). it’s simple that we can configure them with 1 set of autoscaling configurations.

With multiple deployments in the app, my understanding is that each deployment will have its own autoscaling configuration (though the values can all be the same). This means, IMO, each deployment will scale independently according to traffic and queue length. Then in my case, since MLDeployment2 takes longer time to run, the queue length will be much longer, and thus it will eventually have more replicas compared to APIDeployment or MLDeployment1. Is this correct?

Is there a way to scale all replicas at same rate/moment when there are multiple deployments in an app?

From your experience, is above the best approach in terms of throughput and performance? Is there option to scale all replicas proportionally without modifying the code? (I know I can simply put everything into a single giant deployment).

I didn’t find any documentation on autoscaling multiple deployments within one app, so asking here.

Thank you!

Topic		Replies	Views
Autoscaling Replicas in Ray Serve Ray Serve	5	1706	March 12, 2021
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	859	October 20, 2023
Ray Serve replica level autoscaling not working with Kube deployment Ray Serve	3	31	June 11, 2025
Running 10+ models on a ray cluster Kubernetes	1	577	February 27, 2022
Automating the serving of many different models Ray Serve	8	1706	May 3, 2023

Scale Multiple Ray Serve Deployments Proportionally

Related topics