OOM reading "small" parquet file

Clark_Zinzow · August 19, 2022, 2:40pm

Hi @Andrea_Pisoni, it looks like the single file is over 56 GB in memory, so you’re probably right, the data is probably massively inflating in-memory due to the explosion of dictionary-encoded string column(s).

One thing that you could try is asking Ray Datasets to preserve that dictionary encoding in memory:

ds = ray.data.read_parquet(..., dataset_kwargs=dict(read_dictionary=["string_col1", "string_col2"]))

This will get passed to pyarrow.parquet.ParquetDataset under-the-hood: pyarrow.parquet.ParquetDataset — Apache Arrow v9.0.0

Topic		Replies	Views
Cannot read parquet files Ray Data	2	638	April 19, 2023
Ray Data read Parquet loads all the data in one go	4	549	October 21, 2023
[Ray Data] error with read_parquet from hdfs Ray Data	9	824	April 13, 2023
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	1823	November 17, 2022
Ray data experience OOM issue during write_csv or write_parquet Ray Data	2	492	August 2, 2023

OOM reading "small" parquet file

Related topics