How to deploy LLM models that can handle high concurrency based on the Ray serve framework

YUE · September 19, 2023, 10:44am

Hello everyone! I am a student about Ray and LLM.

Now I find a question that My model can only handle problems serially.

For example, if three questions are thrown in at the same time, the model can receive them all at the same time, and whichever one is answered first will be returned.

I have set up all current steps to asynchronous inference, including models, tokenizers and inference returns.

And I didn’t clear the GPU after each answer。

godsakurapeng · January 8, 2024, 8:24am

I think you should try use batch infer or another infer architecture(like vllm).

example

Topic		Replies	Views
Ray serve blocking requests when serving an LLM Ray Serve	3	131	October 20, 2024
Serving LLM with multiple gpus Ray Serve	0	264	July 3, 2024
About the Ray Data LLM APIs category Ray Data LLM APIs	0	17	April 2, 2025
Does ray-llm support only CPU?	0	406	October 25, 2023
Scaling Ray serve with vLLM beyond 2 GPUs Ray Serve	1	2322	February 5, 2024

How to deploy LLM models that can handle high concurrency based on the Ray serve framework

Related topics