Multiple Independent Models behind a single API endpoint?

luca · January 22, 2025, 2:39pm

How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello Ray Community,

I have a question about hosting multiple completely independent models in a single cluster.
I am running multiple LLMs with vllm, which all share the same code for the @serve.deployment actor. Currently, I am hosting multiple applications, each using the same deployment code but with different model weights loaded and behind different endpoints (for example, one application for llama3 70B with the endpoint /llama70/chat and another for llama3 8B behind /llama8/chat/).

Now, I would like to organise this differently: I would prefer all models to sit behind one enpoint (/llm/chat/) and differentiate between the models by passing an extra argument in the http request to this endpoint. What would be the best way to achieve this?

As far as I understood, one application cannot host multiple different deployments and discriminate between them using a parameter in the http request. Is this correct? If not, what am I missing?

Two seperate options came to my mind:

I could try multiplexing the models. This would allow me to call the models in the way I described above. However, this would require significant changes to my code, and I understood the function was meant to be used in cases where not all models can be loaded into memory simultaneously (they can in my case).
Write and ingress application that redirects the incoming request to another application based on the parameter in the http body or header.
Write an ingress deployment that redirects to other deployments, all within one application. This has the drawback that whenever I want to update one model, it would restart all others as well.

What would be the best option for me to host the models?

Thank you so much for your help, I really appreciate it!

Gene · January 23, 2025, 10:34pm

Hi @luca I think all those are viable options. What I have in mind is borrowing from openai’s api interface https://platform.openai.com/docs/api-reference/chat/create. You can create a “router deployment” expose the endpoint and taking model id as one of the request param. You would use model composition to route your request to the appropriate “model deployment”. And each model deployment can how each model and scale independently. Hope this helps.

luca · January 29, 2025, 1:54pm

Hi @Gene ,
Thanks a lot for your response!

In trying to go about implementing this, I realized there is another problem with the model composition option: all my models are at the moment using the same code to run.
What I mean by that is that I essentially have one deplyoment class GenericDeploymentClass, where the only difference between each model is actually the weights I choose to load in the constructor. Currently, I am running multiple applications, each with the same GenericDeploymentClass as a deployment (but different url prefixes), and passing the path to the weights as arguments in the serve config file.

If I want to do this as model composition, how could I still ensure that I can scale different model deployments with different replica counts? They all have the same code, so in my serve config file I cannot distinguish between them. Is this even possible?

Gene · January 30, 2025, 3:20am

You can use .options on your deployment to config them differently. This is just pseudo-code but hopefully it makes sense:

from ray import serve


@serve.deployment
class MyLlamaDeployment:
    def __init__(self, model_loader):
        self.model = model_loader.load()

    def predict(self, input):
        return self.model.predict(input)


@serve.deployment
class MyRouter:
    def __init__(llama8_handle, llama70_handle):
        self.llama8_handle = llama8_handle
        self.llama70_handle = llama70_handle


llama8_app  = MyLlamaDeployment.options(autoscaling_config={"min_replicas": 0,"max_replicas": 10}).bind(llama8_model_loader)
llama70_app  = MyLlamaDeployment.options(autoscaling_config={"min_replicas": 0,"max_replicas": 1}).bind(llama70_model_loader)
router_app = MyRouter.bind(llama8_app, llama70_app)

Topic		Replies	Views
Production best practices for Ray Serve Ray Serve	6	1157	August 15, 2023
About the Ray Serve LLM APIs category Ray Serve LLM APIs	0	17	April 2, 2025
How to run multiple deployments in ray serve 2.0 Ray Serve	10	2394	December 13, 2022
How to route traffic to LiteLLM models using Serving LLMs Ray Serve LLM APIs	7	72	May 20, 2025
Running 10+ models on a ray cluster Kubernetes	1	573	February 27, 2022

Multiple Independent Models behind a single API endpoint?

Related topics