Ray datasets streaming block split?

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hi! I’m interested using ray for a data pipeline task that involves several somewhat long running steps and possibly indefinite stream of data. I think there are several options for the design, but using datasets seems to have an advantage in simplicity.

Because both long stream and long-running tasks, it is necessary to process to the end of the pipeline and write results regularly. Prior to 2.5.0, datasetpipelines were a way to accomplish this, and from this point on the streaming feature of datasets works similarly.

However, there is one sticking point I am wondering about: I can use batch_size > block_size of map_batches to gather up results which will effectively merge blocks and decrease parallelism (the task has overheads which make it more efficient per row given larger batches), but it doesn’t appear that I can split them again to increase parallelism in subsequent stages without repartition, which would require to materialize everything which would effectively disable the streaming functionality. datasetpipelines had a repartition_each_window which did something along these lines.

My questions are these: Is there an existing method to split blocks between map_batches stages in a streaming fashion? If not, would such a feature violate the design principals or is it worth considering as a feature? Any other comments or things to consider?

Many thanks!

Hey @Arik_Mitschang,

Streaming repartitions aren’t supported yet, but it’s a planned feature: [data] [streaming] Support a streaming_repartition() operator · Issue #36724 · ray-project/ray · GitHub