Yes, if you want absolute control over batch composition and ordering, you can manually create a Ray Dataset for each batch and run the processor on it as in your pseudo-code. This approach ensures each batch is processed exactly as you define, but it bypasses Ray Data’s internal batching, pipelining, and parallelism features, and may be less efficient for large-scale workloads (Ray Data docs).
This method is valid for scenarios where strict batch boundaries or custom batch logic are required, but for most use cases, leveraging Ray Data’s built-in batching and parallelism is recommended for performance and scalability.
Sources:
- https://discuss.ray.io/t/datasets-create-custom-dataset-by-grouping-merging-existing-blocks/8176
- https://docs.anyscale.com/llm/batch-inference/resource-allocation/concurrency-and-batching.md
Hint: Mention @RunLLM in the post for followups.