Ray Data does not currently provide a direct flag or argument on map or map_batches to limit the number of precomputed blocks or to directly throttle upstream operators based on available disk space. Backpressure is managed internally, but as you observed, it may not trigger early enough to prevent disk exhaustion, especially when upstream operators are much faster than downstream ones. This is a known limitation, and there is no documented way to set a strict queue or buffer size for operator outputs in the current Ray Data API as of version 2.52.1. See the discussion and code examples in Ray Data: How do we set size of queue if possible? and Ray Data read_text calls read all of input, hogging memory and spilling.
The recommended workaround is to reduce the batch size, decrease the number of parallel actors, or restructure your pipeline to slow down the faster stage, but there is no built-in way to cap the number of in-flight blocks or precomputed results. If you need more granular control, you may need to implement custom logic outside of Ray Data or file a feature request, as this is an active area of development and a common pain point for users with imbalanced pipelines. Would you like more detail on possible workarounds or internal Ray Data backpressure mechanisms?
Sources:
- Ray Data: How do we set size of queue if possible?
- Ray Data read_text calls read all of input, hogging memory and spilling
Hint: Mention @RunLLM in the post for followups.