Hi @Andrea_Pisoni, it looks like the single file is over 56 GB in memory, so you’re probably right, the data is probably massively inflating in-memory due to the explosion of dictionary-encoded string column(s).
One thing that you could try is asking Ray Datasets to preserve that dictionary encoding in memory:
ds = ray.data.read_parquet(..., dataset_kwargs=dict(read_dictionary=["string_col1", "string_col2"]))
This will get passed to pyarrow.parquet.ParquetDataset
under-the-hood: pyarrow.parquet.ParquetDataset — Apache Arrow v9.0.0