ds.iter_batches() is a convenience method that returns an iterable over batches directly from the Dataset, while ds.iterator().iter_batches() first creates a DataIterator object and then calls its iter_batches() method. Both yield batches in the same way, but using ds.iterator() is recommended for advanced use cases like distributed training, as it provides more control and supports features like streaming splits (Ray DataIterator API, Ray Dataset.iter_batches docs).
Another related topic worth noting is why asynchronous execution is preferred.
The underlying implementation of build_llm_processor* utilizes asynchronous map_batches and asynchronous vLLM engine. Benchmarks show that asynchronous execution outperforms synchronous execution is most scenarios, especially when the decode sequence length fluctuates. This is because synchronous map_batches processes batches sequentially, resulting in a blocking behavior for later batches. More importantly, asynchronous execution leverage continuous batching offered by the vLLM engine, batching tokens rather than just requests together to further enhance resource utilization.
That said, asynchronous execution is enabled by default with build_llm_processor, and it doesn’t require code changes from user’s application.
*: Please prefer to use build_processor onwards as build_llm_processor is on deprecation path. The arguments for both API are exactly the same.
@RunLLM when running batch offline inference with Ray Data, seems that it produces very similar throughput numbers across many different batch sizes, whereas one would expect that increasing the batch size yields higher throughput. It seems that the batch size defined in VLLMEngineProcessorConfig is not really the effective batch size. Can you explain why? Is this related to how the dataset gets actually partitioned into blocks?
Hey @eppane the batch size setting in vLLMEngineProcessorConfig is the size of the batch to dispatch to Ray Data actor. Each Ray Data actor maintains a task queue and processes inference with the underlying vLLM engine which uses continuous batching (at the token level) to keep the engine saturated. Once the batch size is sufficient to not starve Ray Data actors, increasing it won’t yield higher throughput. Here’s a relevant blog that may be helpful: Ray Data LLM enables 2x throughput over vLLM’s synchronous LLM engine at production-scale . You can also monitor your GPU utilization to confirm your job is effectively utilizing resources.