I have 15 Parquet files I’m trying to load into Ray, but I’m getting this error:
The blocks of this dataset are estimated to be 15.8x larger than the target block size of 512 MiB. This may lead to out-of-memory errors during processing. Consider reducing the size of input files or using .repartition(n)
to increase the number of dataset blocks.
And indeed my Ray workers die:
1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node
My code is:
files = glob.glob('/download_folder/*')
dataset = ray.data.read_parquet(files)
I then attempted to repartition the data like this code, but it made no difference:
size = dataset.size_bytes()
size_mb = size * 0.000001
ideal_number_blocks = int(size_mb / 512)
logger.debug(f"repartitioning into {ideal_number_blocks} blocks")
dataset = dataset.repartition(ideal_number_blocks)
When loaded using PyArrow, I can see that a sample file has this metadata:
num_columns: 241
num_rows: 4770085
num_row_groups: 13
serialized_size: 408988
the row groups seem to have a batch size of around 164 MB each:
<pyarrow._parquet.RowGroupMetaData object at 0x7f7dc27bff40>
num_columns: 241
num_rows: 395374
total_byte_size: 164240468
How do I fix this?
thanks