I have 300 parquet files with total size of 5 GB. when I read the files using
ray.data.read_parquet(dir_path)
there is very high memory usage. I thought ray is all about lazy loading and processing, but this is not the case for reading a data set
how can I create a ray dataset while not blowing up the memory usage
@Harsh-Maheshwari Are you doing or performing any transformation operations like after reading the dataset like ds.show()
or ds.take(..)
or ds.materialize()
?
cc: @chengsu
No I am just doing a simple ray.data.read.paraquet() in a jupyter notebook
Here the Htop output before reading the data
Here is the Htop output after reading the data
Also note that during the reading process the RAM usage went to something around 29 Gb
At the end I also got this output regarding worker crashed due to OOM Issue
A few things that may be important about my dataset: It has 28000 columns and 65000 rows, which are all numeric columns. Right now I am reading the dataset from almost 200 parquet files