How to distribute load and transform a CSV file

camer314 · February 3, 2022, 10:08am

I have a 5Gb CSV file that I would like to load and transform in shards.

This seems like it is possible looking at this article: https://www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture however the description of window in the article (where its mentioned as 200Gb) does not really match the description of the window argument in the documentation

But more importantly, no matter what I set that window parameter to (10 for example) my machine still blows up because it appears to be loading the entire file.

I was hoping that the following code would send chunks of the source file to my remote, so I would expect the read statement to return almost immediately but it sits there forever blowing memory

ds = ray.data.read_csv('C:\\temp\\50m.csv').window(10).repeat()
ds1 = ds.map_batches(lambda b: func(b), batch_format = 'pandas')

Am I missing something? Is it true that Ray is able to take a large input CSV file and distribute pieces to remotes for transformation, without loading it all into a dataset first?

Topic		Replies	Views
Ray.data.read_csv Huge Dataset memory limitations	0	247	September 5, 2023
Run Ray Dataset in a big dataset Ray Data	2	1039	June 7, 2022
How to use Ray to parallelize splitting a cell? Ray Core	2	315	June 1, 2022
Understanding distributed data loading and training xgboost ray Ray Data	10	1025	July 19, 2023
Distribute computation Ray Data	4	556	April 12, 2023

How to distribute load and transform a CSV file

Related topics