Ray serve blocking requests when serving an LLM

Hey folks, working on serving an llm model using ray serve with FastAPI. I’ve implemented batching of requests on my endpoint, but I’ve noticed that each batch happens serially instead of concurrently.

My guess is that this is a limitation of the asyncio event loop - calling inference on the model is a blocking operation, and the GIL prevents multiple threads happening in true parallel fashion. Hence, we have a single event loop that waits for a batch to complete before waiting for another batch.

I’ve tried utilizing threading with FastAPI by removing any async or await in my code, but it seems that calling remote() on a deployment creates future objects, DeploymentResponse, that need to be awaited.

For my use case, I want to be able to serve as many requests as possible while maintaining efficient batching. That is, simply increasing batch size will reduce throughput due to latency in calculating the forward pass. Is there a way to do so with the current ray serve setup?

I’m not too concerned with OOM issues at the moment - I simply want to call the gpu as many times as possible in parallel.

Hi @Steve_Li do you happen to be using vLLM? If that’s the case, you can use the async engine interface which will play nicely with asyncio: AsyncLLMEngine — vLLM.

Otherwise, I would suggest wrapping the synchronous model call with run_in_executor to run it on a thread pool and avoid blocking the asyncio loop (make sure your model is thread safe or use an executor pool of size 1).

1 Like

Hmmmm I did try run_in_executor, but I still ran into the same concurrency issues. Seems that the actual llm model doesn’t play nicely with the GIL.

I’ll try using vLLM, I also was looking into their Async engine.

Not specifically about batching, but there’re some example code using Ray Serve to serve LLM with vLLM at Serve a Large Language Model with vLLM — Ray 2.37.0