Can Ray Data be used on Datasets that don't fit in memory?


I’ve been trying out Ray Data in the past few days, and have found that it OOMs on datasets I’ve loaded with read_parquet when the uncompressed size + overhead is larger than the memory I have available, even for basic operations like summing a column.

It should be possible to limit dataset operations to only use the memory available, but the documentation (for an example, see aggregate) says that many operations require the dataset to be materialized. Is this just a use-case that Ray Data isn’t intended to support?

Hey @smcintosh, could you tell me more about what you’re trying to use Ray Data for? Ray Data is designed for ML training and inference. It isn’t meant as a replacement for generic ETL pipelines.

Operations like summations materialize all of the data in memory, and there isn’t a way to only use the memory available.