Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I created a ML API application with ray serve and there are the following components (simplified):
- APIDeployment: for API request handling, calls other MLDeployments
- MLDeployment1: a light weight model (i.e. embedding model)
- MLDeployment2: a larger model (i.e. LLM)
The request will sequentially call MLDeployment 1 and then 2 to get the results. Now I’m working on an autoscale deployment for this application and I’m a bit confused and would like to get some suggestions/advice.
With only one deployment in the app (i.e. model inference and API in the same deployment). it’s simple that we can configure them with 1 set of autoscaling configurations.
With multiple deployments in the app, my understanding is that each deployment will have its own autoscaling configuration (though the values can all be the same). This means, IMO, each deployment will scale independently according to traffic and queue length. Then in my case, since MLDeployment2 takes longer time to run, the queue length will be much longer, and thus it will eventually have more replicas compared to APIDeployment or MLDeployment1. Is this correct?
Is there a way to scale all replicas at same rate/moment when there are multiple deployments in an app?
From your experience, is above the best approach in terms of throughput and performance? Is there option to scale all replicas proportionally without modifying the code? (I know I can simply put everything into a single giant deployment).
I didn’t find any documentation on autoscaling multiple deployments within one app, so asking here.
Thank you!