Cannot read parquet files

I have 15 Parquet files I’m trying to load into Ray, but I’m getting this error:

The blocks of this dataset are estimated to be 15.8x larger than the target block size of 512 MiB. This may lead to out-of-memory errors during processing. Consider reducing the size of input files or using .repartition(n) to increase the number of dataset blocks.

And indeed my Ray workers die:
1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node

My code is:

files = glob.glob('/download_folder/*')
dataset = ray.data.read_parquet(files)

I then attempted to repartition the data like this code, but it made no difference:

size = dataset.size_bytes()
size_mb = size * 0.000001
ideal_number_blocks = int(size_mb / 512)
logger.debug(f"repartitioning into {ideal_number_blocks} blocks")
dataset = dataset.repartition(ideal_number_blocks)

When loaded using PyArrow, I can see that a sample file has this metadata:
num_columns: 241
num_rows: 4770085
num_row_groups: 13
serialized_size: 408988

the row groups seem to have a batch size of around 164 MB each:
<pyarrow._parquet.RowGroupMetaData object at 0x7f7dc27bff40>
num_columns: 241
num_rows: 395374
total_byte_size: 164240468

How do I fix this?

thanks

Hi @skunkwerk Which Ray version are you using? How many nodes do you use in the Ray cluster and what’s spec look like?

Note: that Parquet files are very well compressed and can be a few times larger when decompressed when loading into Ray.

I’m using Ray 2.3.1, on a single local node with 64 GB of RAM, 4 CPUs.

thanks