Just two stages present no matter how many stages defined for DatasetPipeline

mark199342 · October 28, 2022, 4:59am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi,

I’m trying to use DatasetPipeline to parallelize data processing and inference in a pipeline fashion. However, it seems no matter how many stages there are for my pipeline setup. There are only two stages showing progress and running in parallel.

I also followed the first example from here Pipelining Compute — Ray 2.0.1, but it still just showed me two stages

Is there anything I missed?

jianxiao · October 28, 2022, 5:01pm

Hi @mark199342, welcome to the Ray community!

This is actually expected. In Ray DatasetPipeline, the operations are executed lazily, which means a .map() call will not be executed right away. The pipeline is executed only when you start consuming data with calls like .iter_rows(). At that point, Datasets will perform an optimization called stage fusion to improve the execution performance and memory efficiency. In your case, the multiple .map() calls will be fused into a single stage. You may check some more details about stage fusion here: Scheduling, Execution, and Memory Management — Ray 2.0.1

mark199342 · October 28, 2022, 7:51pm

Thanks for the clarification! It would be better if we can update the doc to avoid confusion as it is very likely that page is the first one to read for people to onboard.

Thanks!

jianxiao · October 28, 2022, 9:27pm

Thanks for feedback.
How about we display better information when you call print(pipe), e.g. instead of DatasetPipeline(num_windows=20, num_stages=5) (the one in your screenshot), we display DatasetPipeline(num_windows=20, num_stages=5, num_fused_stages=2). Could that help make it more clear?

mark199342 · October 28, 2022, 9:48pm

This would be better.

Topic		Replies	Views
Pipeline DAG: join/aggregate independent steps Ray Data	3	743	January 25, 2023
Run Ray Dataset in a big dataset Ray Data	2	1023	June 7, 2022
Asynchronous dataset pipeline map Ray Data	1	509	April 23, 2022
Ray data creating multiple datasets and repeating map operations on ray dashboard Ray Train	2	161	November 21, 2024
Dataset Pipelines - Window deprecated? Ray Data	2	223	August 29, 2024

Just two stages present no matter how many stages defined for DatasetPipeline

Related topics