How can I limit the number of blocks ray data precomputes?

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.52.1
  • Python version: 3.10.19
  • OS: linux/ubuntu24.04
  • Cloud/Infrastructure: bare metal
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

When creating a ray data pipeline reading videos, I constantly run out of disk space. The problem is that my map_batches inference is way slower than my map_batches preprocessing so ray.data just keeps precomputing the preprocessing step for the frames until the disk is full. The point where it says backpressured and slows down or stops computation of that step is way too late for me.

  • Expected:

    • Ideally, Ray should be aware of the disk space available for spilling and throttle the parts of the pipeline that are basically backpressured.
    • For me, it would be enough if there is a flag or arg I can pass to a map or a map_batches or a Dataset that limits the amount of precomputation happening.
  • Actual:

    • Ray fills up the disk and crashes at some point due to No space left on device

This roughly what my pipeline looks like

dataset = ray.data.read_videos(str(path), include_paths=True).map(add_label_to_row(label))

preprocessed_dataset = dataset.map_batches(
        Preprocessor,
        compute=ray.data.ActorPoolStrategy(size=3),
        num_gpus=0,
        batch_size=100,
    ) # I would like to tell ray here to precompute only e.g. 10 blocks so it would not even have to spill to disk
    
predictions = preprocessed_dataset.map_batches(
        InferenceModel,
        compute=ray.data.ActorPoolStrategy(size=1),
        num_gpus=1,
        batch_size=100,
    )

predictions.write_parquet(...)

Ray Data does not currently provide a direct flag or argument on map or map_batches to limit the number of precomputed blocks or to directly throttle upstream operators based on available disk space. Backpressure is managed internally, but as you observed, it may not trigger early enough to prevent disk exhaustion, especially when upstream operators are much faster than downstream ones. This is a known limitation, and there is no documented way to set a strict queue or buffer size for operator outputs in the current Ray Data API as of version 2.52.1. See the discussion and code examples in Ray Data: How do we set size of queue if possible? and Ray Data read_text calls read all of input, hogging memory and spilling.

The recommended workaround is to reduce the batch size, decrease the number of parallel actors, or restructure your pipeline to slow down the faster stage, but there is no built-in way to cap the number of in-flight blocks or precomputed results. If you need more granular control, you may need to implement custom logic outside of Ray Data or file a feature request, as this is an active area of development and a common pain point for users with imbalanced pipelines. Would you like more detail on possible workarounds or internal Ray Data backpressure mechanisms?

Sources:

Hint: Mention @RunLLM in the post for followups.