Just two stages present no matter how many stages defined for DatasetPipeline

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.


I’m trying to use DatasetPipeline to parallelize data processing and inference in a pipeline fashion. However, it seems no matter how many stages there are for my pipeline setup. There are only two stages showing progress and running in parallel.

I also followed the first example from here Pipelining Compute — Ray 2.0.1, but it still just showed me two stages

Is there anything I missed?

Hi @mark199342, welcome to the Ray community!

This is actually expected. In Ray DatasetPipeline, the operations are executed lazily, which means a .map() call will not be executed right away. The pipeline is executed only when you start consuming data with calls like .iter_rows(). At that point, Datasets will perform an optimization called stage fusion to improve the execution performance and memory efficiency. In your case, the multiple .map() calls will be fused into a single stage. You may check some more details about stage fusion here: Scheduling, Execution, and Memory Management — Ray 2.0.1

Thanks for the clarification! It would be better if we can update the doc to avoid confusion as it is very likely that page is the first one to read for people to onboard.


Thanks for feedback.
How about we display better information when you call print(pipe), e.g. instead of DatasetPipeline(num_windows=20, num_stages=5) (the one in your screenshot), we display DatasetPipeline(num_windows=20, num_stages=5, num_fused_stages=2). Could that help make it more clear?

This would be better.