Hello,
I’ve been trying out Ray Data in the past few days, and have found that it OOMs on datasets I’ve loaded with read_parquet
when the uncompressed size + overhead is larger than the memory I have available, even for basic operations like summing a column.
It should be possible to limit dataset operations to only use the memory available, but the documentation (for an example, see aggregate) says that many operations require the dataset to be materialized. Is this just a use-case that Ray Data isn’t intended to support?