Ray serve blocking requests when serving an LLM

Steve_Li · October 10, 2024, 12:25am

Hey folks, working on serving an llm model using ray serve with FastAPI. I’ve implemented batching of requests on my endpoint, but I’ve noticed that each batch happens serially instead of concurrently.

My guess is that this is a limitation of the asyncio event loop - calling inference on the model is a blocking operation, and the GIL prevents multiple threads happening in true parallel fashion. Hence, we have a single event loop that waits for a batch to complete before waiting for another batch.

I’ve tried utilizing threading with FastAPI by removing any async or await in my code, but it seems that calling remote() on a deployment creates future objects, DeploymentResponse, that need to be awaited.

For my use case, I want to be able to serve as many requests as possible while maintaining efficient batching. That is, simply increasing batch size will reduce throughput due to latency in calculating the forward pass. Is there a way to do so with the current ray serve setup?

I’m not too concerned with OOM issues at the moment - I simply want to call the gpu as many times as possible in parallel.

eoakes · October 15, 2024, 8:52pm

Hi @Steve_Li do you happen to be using vLLM? If that’s the case, you can use the async engine interface which will play nicely with asyncio: AsyncLLMEngine — vLLM.

Otherwise, I would suggest wrapping the synchronous model call with run_in_executor to run it on a thread pool and avoid blocking the asyncio loop (make sure your model is thread safe or use an executor pool of size 1).

Steve_Li · October 20, 2024, 8:59pm

Hmmmm I did try run_in_executor, but I still ran into the same concurrency issues. Seems that the actual llm model doesn’t play nicely with the GIL.

I’ll try using vLLM, I also was looking into their Async engine.

Gene · October 20, 2024, 9:03pm

Not specifically about batching, but there’re some example code using Ray Serve to serve LLM with vLLM at Serve a Large Language Model with vLLM — Ray 2.37.0

Topic		Replies	Views
How to deploy LLM models that can handle high concurrency based on the Ray serve framework Ray Serve	1	1089	January 8, 2024
Ray Serve LLM APIs has 2~3x higher latency Ray Serve LLM APIs	7	226	May 19, 2025
Concurrently Processing Requests w/ Ray Serve Ray Serve	1	1041	April 6, 2023
Serving LLM with multiple gpus Ray Serve	0	296	July 3, 2024
vLLM Inferencing on multiGPU Ray Serve	7	1148	September 24, 2024

Ray serve blocking requests when serving an LLM

Related topics