I have a 5Gb CSV file that I would like to load and transform in shards.
This seems like it is possible looking at this article: https://www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture however the description of window in the article (where its mentioned as 200Gb) does not really match the description of the window argument in the documentation
But more importantly, no matter what I set that window parameter to (10 for example) my machine still blows up because it appears to be loading the entire file.
I was hoping that the following code would send chunks of the source file to my remote, so I would expect the read statement to return almost immediately but it sits there forever blowing memory
ds = ray.data.read_csv('C:\\temp\\50m.csv').window(10).repeat() ds1 = ds.map_batches(lambda b: func(b), batch_format = 'pandas')
Am I missing something? Is it true that Ray is able to take a large input CSV file and distribute pieces to remotes for transformation, without loading it all into a dataset first?