Ray Data read Parquet loads all the data in one go

Harsh-Maheshwari · October 17, 2023, 6:39am

I have 300 parquet files with total size of 5 GB. when I read the files using

ray.data.read_parquet(dir_path)

there is very high memory usage. I thought ray is all about lazy loading and processing, but this is not the case for reading a data set

how can I create a ray dataset while not blowing up the memory usage

Jules_Damji · October 17, 2023, 9:44pm

@Harsh-Maheshwari Are you doing or performing any transformation operations like after reading the dataset like ds.show() or ds.take(..) or ds.materialize()?

cc: @chengsu

Harsh-Maheshwari · October 21, 2023, 7:59pm

No I am just doing a simple ray.data.read.paraquet() in a jupyter notebook

Here the Htop output before reading the data

Here is the Htop output after reading the data

Also note that during the reading process the RAM usage went to something around 29 Gb

Harsh-Maheshwari · October 21, 2023, 8:02pm

At the end I also got this output regarding worker crashed due to OOM Issue

Harsh-Maheshwari · October 21, 2023, 8:10pm

A few things that may be important about my dataset: It has 28000 columns and 65000 rows, which are all numeric columns. Right now I am reading the dataset from almost 200 parquet files

Topic		Replies	Views
OOM reading "small" parquet file Ray Data	2	1241	September 1, 2022
Data loading of parquet files is very memory consuming Ray Data	2	1432	June 21, 2022
Ray worker dies when reading multiple parquet files Ray Data	3	785	November 17, 2022
Cannot read parquet files Ray Data	2	651	April 19, 2023
Can Ray Data be used on Datasets that don't fit in memory?	1	368	January 18, 2024

Ray Data read Parquet loads all the data in one go

Related topics