Offline inference vLLM: map_batches vs build_llm_processor

RunLLM · January 28, 2026, 4:32am

ds.iter_batches() is a convenience method that returns an iterable over batches directly from the Dataset, while ds.iterator().iter_batches() first creates a DataIterator object and then calls its iter_batches() method. Both yield batches in the same way, but using ds.iterator() is recommended for advanced use cases like distributed training, as it provides more control and supports features like streaming splits (Ray DataIterator API, Ray Dataset.iter_batches docs).

Would you like more detail or a code example?

Sources:

Hint: Mention @RunLLM in the post for followups.

Jeffrey_Wang · February 6, 2026, 5:14pm

Another related topic worth noting is why asynchronous execution is preferred.

The underlying implementation of build_llm_processor* utilizes asynchronous map_batches and asynchronous vLLM engine. Benchmarks show that asynchronous execution outperforms synchronous execution is most scenarios, especially when the decode sequence length fluctuates. This is because synchronous map_batches processes batches sequentially, resulting in a blocking behavior for later batches. More importantly, asynchronous execution leverage continuous batching offered by the vLLM engine, batching tokens rather than just requests together to further enhance resource utilization.

That said, asynchronous execution is enabled by default with build_llm_processor, and it doesn’t require code changes from user’s application.

*: Please prefer to use build_processor onwards as build_llm_processor is on deprecation path. The arguments for both API are exactly the same.

eppane · February 27, 2026, 8:17pm

@RunLLM when running batch offline inference with Ray Data, seems that it produces very similar throughput numbers across many different batch sizes, whereas one would expect that increasing the batch size yields higher throughput. It seems that the batch size defined in VLLMEngineProcessorConfig is not really the effective batch size. Can you explain why? Is this related to how the dataset gets actually partitioned into blocks?

Jeffrey_Wang · March 2, 2026, 11:40pm

Hey @eppane the batch size setting in vLLMEngineProcessorConfig is the size of the batch to dispatch to Ray Data actor. Each Ray Data actor maintains a task queue and processes inference with the underlying vLLM engine which uses continuous batching (at the token level) to keep the engine saturated. Once the batch size is sufficient to not starve Ray Data actors, increasing it won’t yield higher throughput. Here’s a relevant blog that may be helpful: Ray Data LLM enables 2x throughput over vLLM’s synchronous LLM engine at production-scale . You can also monitor your GPU utilization to confirm your job is effectively utilizing resources.

Topic		Replies	Views
About the Ray Data LLM APIs category Ray Data LLM APIs	0	46	April 2, 2025
Does map_batches avoid saturating the inference engine? Ray Data LLM APIs	1	85	May 25, 2025
Ray Serve LLM APIs has 2~3x higher latency Ray Serve LLM APIs	7	426	May 19, 2025
Does RayData Support multi-node vllm inference Ray Data LLM APIs	2	526	May 23, 2025
vLLM Inferencing on multiGPU Ray Serve	7	1448	September 24, 2024

Offline inference vLLM: map_batches vs build_llm_processor

Related topics