Ray dataset from IterableDataset. No lazy implementation?

Eugene_Zabrotsky · November 15, 2024, 1:27pm

I’m trying to convert our internal IterableDataset to Ray Dataset.
The dataset is huge, it reads parquet files and downloads images from s3, assigns buckets based on aspect ratio and yields batches from the same bucket.
As bucketing is not supported in Ray Data we decided to convert the Dataset with ray.data.from_torch()

But I’ve noticed, that before yielding any data with iter_rows() for example, Ray dataset runs over the whole dataset in advance, which is in our case makes use of it impossible. But it seems like an odd, unintended behaviour. Is it really so? How can we circumvent this problem?

Here is the minimal reproducible example (To ease of debugging, I expanded the from_torch implementation)

Topic		Replies	Views
Converting torch.utils.data.IterableDataset to Ray's Dataset Ray Data	4	796	April 13, 2022
Converting IterableDataset to Ray iterator Ray Core	2	441	April 12, 2022
How to Keep Tensor Shape w/Ray Datasets? Ray Data	2	471	June 16, 2022
[SGD] [Tune] Issue with ray.util.sgd.data.Dataset API Ray Tune	6	500	April 23, 2021
Can Ray Dataset be used between S3 and PyTorch? Ray Data	4	1166	February 17, 2022

Ray dataset from IterableDataset. No lazy implementation?

Related topics