1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
[X ] High: Completely blocks me.
2. Environment:
- Ray version: 2.36
- Python version: 3.9
- OS: AL2023
- Cloud/Infrastructure: AWS
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: Ray data read_text gets backpressure from following slower stages, slows down and does not materialize the whole dataset.
- Actual: Ray data read_text read all of data without waiting for the subsequent stages to consume it, all of the data gets spilled and then Ray SpillWorker hogs a huge chunk of memory for ever, even when all of the code that created the ray data pipeline goes out of scope.