How to deploy LLM models that can handle high concurrency based on the Ray serve framework

Hello everyone! I am a student about Ray and LLM.

Now I find a question that My model can only handle problems serially.

For example, if three questions are thrown in at the same time, the model can receive them all at the same time, and whichever one is answered first will be returned.

I have set up all current steps to asynchronous inference, including models, tokenizers and inference returns.

And I didn’t clear the GPU after each answer。

I think you should try use batch infer or another infer architecture(like vllm).