Ray datasets streaming block split?

Arik_Mitschang · June 15, 2023, 10:50pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Hi! I’m interested using ray for a data pipeline task that involves several somewhat long running steps and possibly indefinite stream of data. I think there are several options for the design, but using datasets seems to have an advantage in simplicity.

Because both long stream and long-running tasks, it is necessary to process to the end of the pipeline and write results regularly. Prior to 2.5.0, datasetpipelines were a way to accomplish this, and from this point on the streaming feature of datasets works similarly.

However, there is one sticking point I am wondering about: I can use batch_size > block_size of map_batches to gather up results which will effectively merge blocks and decrease parallelism (the task has overheads which make it more efficient per row given larger batches), but it doesn’t appear that I can split them again to increase parallelism in subsequent stages without repartition, which would require to materialize everything which would effectively disable the streaming functionality. datasetpipelines had a repartition_each_window which did something along these lines.

My questions are these: Is there an existing method to split blocks between map_batches stages in a streaming fashion? If not, would such a feature violate the design principals or is it worth considering as a feature? Any other comments or things to consider?

Many thanks!

bveeramani · June 27, 2023, 12:29am

Hey @Arik_Mitschang,

Streaming repartitions aren’t supported yet, but it’s a planned feature: [data] [streaming] Support a streaming_repartition() operator · Issue #36724 · ray-project/ray · GitHub

Topic		Replies	Views
Dataset support concurrency in one block when using map_batches	4	698	October 1, 2022
Split operation optimization Ray Data	0	188	January 31, 2024
Distribute computation Ray Data	4	539	April 12, 2023
[Datasets] Create custom dataset by grouping/merging existing blocks Ray Data	9	1298	November 30, 2022
Running batches of data by multiple work process Ray Core	5	524	April 6, 2022

Ray datasets streaming block split?

Related topics