How can I limit the number of blocks ray data precomputes?

RunLLM · December 8, 2025, 8:45am

Ray Data does not currently provide a direct flag or argument on map or map_batches to limit the number of precomputed blocks or to directly throttle upstream operators based on available disk space. Backpressure is managed internally, but as you observed, it may not trigger early enough to prevent disk exhaustion, especially when upstream operators are much faster than downstream ones. This is a known limitation, and there is no documented way to set a strict queue or buffer size for operator outputs in the current Ray Data API as of version 2.52.1. See the discussion and code examples in Ray Data: How do we set size of queue if possible? and Ray Data read_text calls read all of input, hogging memory and spilling.

The recommended workaround is to reduce the batch size, decrease the number of parallel actors, or restructure your pipeline to slow down the faster stage, but there is no built-in way to cap the number of in-flight blocks or precomputed results. If you need more granular control, you may need to implement custom logic outside of Ray Data or file a feature request, as this is an active area of development and a common pain point for users with imbalanced pipelines. Would you like more detail on possible workarounds or internal Ray Data backpressure mechanisms?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Benchmarks for Ray Data? Ray Data	13	1130	October 5, 2023
[Ray] [Data] how do we set size of queue if possible	1	176	April 1, 2024
Ray Data streaming not streaming smoothly Ray Data	8	834	May 30, 2023
Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)	20	1783	July 28, 2022
Large dataset ray dataset OOM Ray Tune	2	476	July 3, 2023

How can I limit the number of blocks ray data precomputes?

Related topics