Hey folks, working on serving an llm model using ray serve
with FastAPI. I’ve implemented batching of requests on my endpoint, but I’ve noticed that each batch happens serially instead of concurrently.
My guess is that this is a limitation of the asyncio event loop - calling inference on the model is a blocking operation, and the GIL prevents multiple threads happening in true parallel fashion. Hence, we have a single event loop that waits for a batch to complete before waiting for another batch.
I’ve tried utilizing threading with FastAPI by removing any async
or await
in my code, but it seems that calling remote()
on a deployment creates future objects, DeploymentResponse
, that need to be awaited.
For my use case, I want to be able to serve as many requests as possible while maintaining efficient batching. That is, simply increasing batch size will reduce throughput due to latency in calculating the forward pass. Is there a way to do so with the current ray serve
setup?
I’m not too concerned with OOM issues at the moment - I simply want to call the gpu as many times as possible in parallel.