Ray Data read Parquet loads all the data in one go

I have 300 parquet files with total size of 5 GB. when I read the files using


there is very high memory usage. I thought ray is all about lazy loading and processing, but this is not the case for reading a data set

how can I create a ray dataset while not blowing up the memory usage

@Harsh-Maheshwari Are you doing or performing any transformation operations like after reading the dataset like ds.show() or ds.take(..) or ds.materialize()?

cc: @chengsu

No I am just doing a simple ray.data.read.paraquet() in a jupyter notebook

Here the Htop output before reading the data

Here is the Htop output after reading the data

Also note that during the reading process the RAM usage went to something around 29 Gb

At the end I also got this output regarding worker crashed due to OOM Issue

A few things that may be important about my dataset: It has 28000 columns and 65000 rows, which are all numeric columns. Right now I am reading the dataset from almost 200 parquet files