Distribute computation

Hi all,

I’m build a data processing pipeline using Ray Datasets and I have some question. In the begining of the pipeline I load jsonlines that possibly are unbalanced and then I use a flat_map to process them. Does the flat_map distribute the load between the blocks or do I need to call repartition to recompute the blocks?

Hi @ssamdav If I understand correctly, you are reading a bunch of JSON files, which have unbalanced sizes. Do you know how large are those files? Have you checked if the blocks of the Dataset is really highly unbalanced?

The Datasets has a feature called dynamic block splitting, which will create multiple blocks from the same file if it’s large. By default this splitting will happen if the block size is larger than 512MB (ray/context.py at master · ray-project/ray · GitHub), and it can be set to a different value based on use case. So the block splitting should make sure blocks are not highly skewed in sizes.

@ssamdav does @jianxiao response answer your question?

1 Like

Yes it does, thanks!

Excellen, and thanks for filling the issue