How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello Ray Community,
I have a question about hosting multiple completely independent models in a single cluster.
I am running multiple LLMs with vllm, which all share the same code for the @serve.deployment
actor. Currently, I am hosting multiple applications, each using the same deployment code but with different model weights loaded and behind different endpoints (for example, one application for llama3 70B with the endpoint /llama70/chat
and another for llama3 8B behind /llama8/chat/
).
Now, I would like to organise this differently: I would prefer all models to sit behind one enpoint (/llm/chat/
) and differentiate between the models by passing an extra argument in the http request to this endpoint. What would be the best way to achieve this?
As far as I understood, one application cannot host multiple different deployments and discriminate between them using a parameter in the http request. Is this correct? If not, what am I missing?
Two seperate options came to my mind:
- I could try multiplexing the models. This would allow me to call the models in the way I described above. However, this would require significant changes to my code, and I understood the function was meant to be used in cases where not all models can be loaded into memory simultaneously (they can in my case).
- Write and ingress application that redirects the incoming request to another application based on the parameter in the http body or header.
- Write an ingress deployment that redirects to other deployments, all within one application. This has the drawback that whenever I want to update one model, it would restart all others as well.
What would be the best option for me to host the models?
Thank you so much for your help, I really appreciate it!