OOM reading "small" parquet file

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a parquet file on HDFS that i am trying to read with Ray Dataset. I am going Out of Memory on 64GB - It’s strange because the file on HDFS is just 3GB, but I am thinking that since the columns are saved as string - it might just be exploding in memory. However, the error message I receive is strange:

	2022-08-19 08:06:35,929	ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): e[36mray::_StatsActor.record_task()e[39m (pid=758, ip=, repr=<ray.data.impl.stats._StatsActor object at 0x7ff8fa191af0>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node k4aczpfr75r7ctfy is used (59.29 / 59.51 GB). The top 10 memory consumers are:
652	57.85GiB	ray::IDLE
206	0.41GiB	/home/cdsw/.conda/envs/ray_env/bin/python -m ipykernel_launcher -f /home/cdsw/.local/share/jupyter/r
461	0.11GiB	/home/cdsw/.conda/envs/ray_env/bin/python -u /home/cdsw/.local/lib/python3.9/site-packages/ray/dashb
561	0.1GiB	/home/cdsw/.conda/envs/ray_env/bin/python -u /home/cdsw/.local/lib/python3.9/site-packages/ray/dashb
758	0.06GiB	ray::_StatsActor
941	0.06GiB	ray::IDLE
164	0.06GiB	/usr/local/bin/python3.6 /usr/local/bin/jupyter-lab --no-browser --ip= --port=8090 --Notebo
446	0.05GiB	/home/cdsw/.conda/envs/ray_env/bin/python -u /home/cdsw/.local/lib/python3.9/site-packages/ray/autos
98	0.05GiB	/usr/local/bin/python3.6 /var/lib/cdsw/python3-engine-deps/bin/ipython3 kernel --automagic --no-secu
453	0.05GiB	/home/cdsw/.conda/envs/ray_env/bin/python -m ray.util.client.server --address= --
In addition, up to 0.1 GiB of shared memory is currently being used by the Ray object store.

Shows that the memory is occupied by Ray IDLE? Also:

[2022-08-19 08:02:20,228 E 652 652] plasma_store_provider.cc:132: Failed to put object 32d950ec0ccf9d2affffffffffffffffffffffff0100000001000000 in object store because it is full. Object size is 562617392276 bytes.
Plasma store status:
(global lru) capacity: 18795524505
(global lru) used: 0%
(global lru) num objects: 0
(global lru) num evictions: 0
(global lru) bytes evicted: 0

Object store is full - but at the same time empty?

I am thinking maybe this is just a strange error message but the underlying issue is simply the parquet file exploding in memory due to string types. Any way I can just sample a few records of the parquet file to see how much space they would take? I used limit and take(10) but it seems the whole file is still loaded in memory (which causes the crash in turn).

Hi @Andrea_Pisoni, it looks like the single file is over 56 GB in memory, so you’re probably right, the data is probably massively inflating in-memory due to the explosion of dictionary-encoded string column(s).

One thing that you could try is asking Ray Datasets to preserve that dictionary encoding in memory:

ds = ray.data.read_parquet(..., dataset_kwargs=dict(read_dictionary=["string_col1", "string_col2"]))

This will get passed to pyarrow.parquet.ParquetDataset under-the-hood: pyarrow.parquet.ParquetDataset — Apache Arrow v9.0.0


Indeed, that did the trick of course. Learned in the process that a lot of the arrow arguments can be used in Ray read parquet - neat!

1 Like