Can Ray Data be used on Datasets that don't fit in memory?

smcintosh · January 14, 2024, 11:02pm

Hello,

I’ve been trying out Ray Data in the past few days, and have found that it OOMs on datasets I’ve loaded with read_parquet when the uncompressed size + overhead is larger than the memory I have available, even for basic operations like summing a column.

It should be possible to limit dataset operations to only use the memory available, but the documentation (for an example, see aggregate) says that many operations require the dataset to be materialized. Is this just a use-case that Ray Data isn’t intended to support?

bveeramani · January 18, 2024, 8:53pm

Hey @smcintosh, could you tell me more about what you’re trying to use Ray Data for? Ray Data is designed for ML training and inference. It isn’t meant as a replacement for generic ETL pipelines.

Operations like summations materialize all of the data in memory, and there isn’t a way to only use the memory available.

Topic		Replies	Views
Ray Data read Parquet loads all the data in one go	4	541	October 21, 2023
Using ray for data processing	2	666	January 6, 2023
Problem with anything on Ray Ray Data	2	620	April 20, 2022
Shared dataset on a local desktop	1	286	March 7, 2023
Ray.data.read_csv Huge Dataset memory limitations	0	237	September 5, 2023

Can Ray Data be used on Datasets that don't fit in memory?

Related topics