OOM reading "small" parquet file

Hi @Andrea_Pisoni, it looks like the single file is over 56 GB in memory, so you’re probably right, the data is probably massively inflating in-memory due to the explosion of dictionary-encoded string column(s).

One thing that you could try is asking Ray Datasets to preserve that dictionary encoding in memory:

ds = ray.data.read_parquet(..., dataset_kwargs=dict(read_dictionary=["string_col1", "string_col2"]))

This will get passed to pyarrow.parquet.ParquetDataset under-the-hood: pyarrow.parquet.ParquetDataset — Apache Arrow v9.0.0

3 Likes