Cannot read parquet files

skunkwerk · April 12, 2023, 8:03pm

I have 15 Parquet files I’m trying to load into Ray, but I’m getting this error:

The blocks of this dataset are estimated to be 15.8x larger than the target block size of 512 MiB. This may lead to out-of-memory errors during processing. Consider reducing the size of input files or using .repartition(n) to increase the number of dataset blocks.

And indeed my Ray workers die:
1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node

My code is:

files = glob.glob('/download_folder/*')
dataset = ray.data.read_parquet(files)

I then attempted to repartition the data like this code, but it made no difference:

size = dataset.size_bytes()
size_mb = size * 0.000001
ideal_number_blocks = int(size_mb / 512)
logger.debug(f"repartitioning into {ideal_number_blocks} blocks")
dataset = dataset.repartition(ideal_number_blocks)

When loaded using PyArrow, I can see that a sample file has this metadata:
num_columns: 241
num_rows: 4770085
num_row_groups: 13
serialized_size: 408988

the row groups seem to have a batch size of around 164 MB each:
<pyarrow._parquet.RowGroupMetaData object at 0x7f7dc27bff40>
num_columns: 241
num_rows: 395374
total_byte_size: 164240468

How do I fix this?

thanks

jianxiao · April 13, 2023, 8:43pm

Hi @skunkwerk Which Ray version are you using? How many nodes do you use in the Ray cluster and what’s spec look like?

Note: that Parquet files are very well compressed and can be a few times larger when decompressed when loading into Ray.

skunkwerk · April 19, 2023, 9:11pm

I’m using Ray 2.3.1, on a single local node with 64 GB of RAM, 4 CPUs.

thanks

Topic		Replies	Views
OOM reading "small" parquet file Ray Data	2	1211	September 1, 2022
Ray Data read Parquet loads all the data in one go	4	565	October 21, 2023
Ray worker dies when reading multiple parquet files Ray Data	3	757	November 17, 2022
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	1836	November 17, 2022
Repartition kiilled because OOM? Ray Data	1	501	April 22, 2022

Cannot read parquet files

Related topics