What is the downside to use small block instead of large blocks?

tarjintor · November 8, 2023, 7:57am

Hello,I am using ray.dataset zip to mock join on unique key of 2 dataset
For example,I use some different user’s codes to read 2 datasets,they are their own bussiness logic,so those ds can have same row num,but I can’t make sure they have same num of blocks, one dataset read from file store monthly and the other is store daily for example.
They also have a primary key of same set, so I can use sort then join to archive join.
But I found dataset’s zip method need these 2 dataset not only have same rows but also same blocks.
I plan to use ray.dataset.repartition to make sure these 2 dataset have same rows and blocks
But I am not sure about the downsize of this

tarjintor · November 8, 2023, 8:04am

Or I can just make sure the reader of any dataset have one row one block？

Topic		Replies	Views
What is the best practise to join 2 datasets on primary key	1	390	August 25, 2023
Ray datasets streaming block split? Ray Data	1	641	June 27, 2023
[Datasets] Create custom dataset by grouping/merging existing blocks Ray Data	9	1280	November 30, 2022
Use 'zip' to concat two dataset	1	331	August 26, 2022
Distribute computation Ray Data	4	535	April 12, 2023

What is the downside to use small block instead of large blocks?

Related topics