Dynamic Deployment on Ray Serve

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.

We’ve been using Ray Serve for LLM Serving for some time (every deployment being a different model), and experiencing all of the facilitations it provides has been significantly valuable.

However throughout this time there has always been one thing that blocked us various times and usually lose time. Whenever a new serve deployment needs to be deployed we restart the whole cluster. I feel like especially in a time like this where a new OSS model is being released so frequently it’d add a value to being able to deploy new serve deployments dynamically through API.

From other discussions, I know that the graph is built at the beginning but I wanted to have a better idea how likely this enhancement would be possible (or not possible) from the community and is there any other solution/tip that we can use for our case instead of restarting cluster.

Thank you.

cc @Gene and @kourosh and @Akshay_Malik

Could you share a bit more about what does your application look like? Like how you construct your application/graph.

Sorry for the late reply (don’t get the notifications or I oversee them somehow).

Our structure has a set of serve deployments (a different serve deployment for each model), each of which contains a vLLM engine. And there is Ingress deployment that has different logic inside but the basic responsibility is having all of the deployment handles and directs request to the corresponding deployment using the model name in the chat completion request.

Now when a deployment needs to be undeployed or a new deployment needs to be deployed we need to restart the whole cluster (that means taking the cluster out of route first and making sure it gets healthy on restart) so the graph will be built with the set of deployments that includes new model deployment, or excludes the removed one.

Ray Serve has facilitated many aspects in deployment and gives quite unique capabilities but using k8s would give the flexibility of having new deployments or undeploying things on the fly without taking the whole cluster down and up again.

I believe if Ray Serve would have this aspect it would make it quite more powerful.

I know the whole logic of deployment handle ownership is buried inside python code, which could make it harder to make deploy/undeploy things dynamically. But the case I described would be quite common where ingress holds a set of deployment handles in a data structure (python dict in our case that is “model_name”->model_deployment_handle) and if it would be possible to actually update this data structure (how generic this should be is another topic) somehow on the fly by adding a new serve deployment definition on the fly and then instantiating a deployment handle and adding this handle to this data structure, it could give Ray Serve deploying/undeploying things in a Pythonic way. That was just an idea and I’d be happy to hear if there has been any ideas regarding this over time.

So there is the model multiplexing as solution and that is actually possible to use. But it’s not feasible. Let me explain why based on the example of our workflow.
Some LLMs use 1 GPUs, some use 4 and some use 8 (call those small, medium and large sized deployments).
We could actually dynamically deploy and serve models using max_num_models_per_replica=1, selecting one of three deployments based on the required size. But that does not actually create any new deployment but just creates a new replica for each new multiplexing key string. So that means any autoscaling config (min_replicas etc.) or any deployment specific configs, any deployment specific feature (UI visibility etc.) is not going to be there per model. Hence model multiplexing is not the best solution here while what is needed is a new deployment.

I’m happy to discuss this further.

Thanks.

@rliaw

@Erkin if you have separate applications for each model, then this can work - Updating Applications In-Place — Ray 2.43.0 . Essentially, you’d apply a new serve config with an added or removed apps, and only the changed apps will be re-deployed. The ingress deployment can use the get_deployment_handle API.