Ray data read_text calls read all of input hogging memory and spilling

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
[X ] High: Completely blocks me.

2. Environment:

  • Ray version: 2.36
  • Python version: 3.9
  • OS: AL2023
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: Ray data read_text gets backpressure from following slower stages, slows down and does not materialize the whole dataset.
  • Actual: Ray data read_text read all of data without waiting for the subsequent stages to consume it, all of the data gets spilled and then Ray SpillWorker hogs a huge chunk of memory for ever, even when all of the code that created the ray data pipeline goes out of scope.

What errors specifically seeing from the Ray Spillworker, is it out of memory error? Do you have any logs we can take a look at?