Ray data read_text calls read all of input hogging memory and spilling

vgill · June 17, 2025, 8:03am

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
[X ] High: Completely blocks me.

2. Environment:

Ray version: 2.36
Python version: 3.9
OS: AL2023
Cloud/Infrastructure: AWS
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: Ray data read_text gets backpressure from following slower stages, slows down and does not materialize the whole dataset.
Actual: Ray data read_text read all of data without waiting for the subsequent stages to consume it, all of the data gets spilled and then Ray SpillWorker hogs a huge chunk of memory for ever, even when all of the code that created the ray data pipeline goes out of scope.

christina · June 17, 2025, 11:05pm

What errors specifically seeing from the Ray Spillworker, is it out of memory error? Do you have any logs we can take a look at?

Topic		Replies	Views
ray::IDLE_SpillWorker memory consumption and OOM Ray Clusters	4	255	September 10, 2024
Benchmarks for Ray Data? Ray Data	13	1044	October 5, 2023
Object store spilling terabytes of data Ray Core	6	2348	January 11, 2023
Best practices around handling giant datasets with ray data (large amount of read tasks)	5	169	October 15, 2024
Ray data.read_csv keeps pausing Ray Data	3	409	September 28, 2023

Ray data read_text calls read all of input hogging memory and spilling

Related topics