1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
2. Environment:
- Ray version: 2.52.1
- Python version: 3.10.19
- OS: linux/ubuntu24.04
- Cloud/Infrastructure: bare metal
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
When creating a ray data pipeline reading videos, I constantly run out of disk space. The problem is that my map_batches inference is way slower than my map_batches preprocessing so ray.data just keeps precomputing the preprocessing step for the frames until the disk is full. The point where it says backpressured and slows down or stops computation of that step is way too late for me.
-
Expected:
- Ideally, Ray should be aware of the disk space available for spilling and throttle the parts of the pipeline that are basically backpressured.
- For me, it would be enough if there is a flag or arg I can pass to a
mapor amap_batchesor aDatasetthat limits the amount of precomputation happening.
-
Actual:
- Ray fills up the disk and crashes at some point due to
No space left on device
- Ray fills up the disk and crashes at some point due to
This roughly what my pipeline looks like
dataset = ray.data.read_videos(str(path), include_paths=True).map(add_label_to_row(label))
preprocessed_dataset = dataset.map_batches(
Preprocessor,
compute=ray.data.ActorPoolStrategy(size=3),
num_gpus=0,
batch_size=100,
) # I would like to tell ray here to precompute only e.g. 10 blocks so it would not even have to spill to disk
predictions = preprocessed_dataset.map_batches(
InferenceModel,
compute=ray.data.ActorPoolStrategy(size=1),
num_gpus=1,
batch_size=100,
)
predictions.write_parquet(...)