Distribute computation

ssamdav · April 5, 2023, 1:00pm

Hi all,

I’m build a data processing pipeline using Ray Datasets and I have some question. In the begining of the pipeline I load jsonlines that possibly are unbalanced and then I use a flat_map to process them. Does the flat_map distribute the load between the blocks or do I need to call repartition to recompute the blocks?

jianxiao · April 7, 2023, 4:32pm

Hi @ssamdav If I understand correctly, you are reading a bunch of JSON files, which have unbalanced sizes. Do you know how large are those files? Have you checked if the blocks of the Dataset is really highly unbalanced?

The Datasets has a feature called dynamic block splitting, which will create multiple blocks from the same file if it’s large. By default this splitting will happen if the block size is larger than 512MB (ray/context.py at master · ray-project/ray · GitHub), and it can be set to a different value based on use case. So the block splitting should make sure blocks are not highly skewed in sizes.

Jules_Damji · April 7, 2023, 11:24pm

@ssamdav does @jianxiao response answer your question?

ssamdav · April 12, 2023, 1:01pm

Yes it does, thanks!

Jules_Damji · April 12, 2023, 1:32pm

Excellen, and thanks for filling the issue

Topic		Replies	Views
Dataset support concurrency in one block when using map_batches	4	698	October 1, 2022
Ray datasets streaming block split? Ray Data	1	661	June 27, 2023
Run Ray Dataset in a big dataset Ray Data	2	1024	June 7, 2022
[Datasets] Create custom dataset by grouping/merging existing blocks Ray Data	9	1298	November 30, 2022
Cannot read parquet files Ray Data	2	650	April 19, 2023

Distribute computation

Related topics