I’m trying to convert our internal IterableDataset to Ray Dataset.
The dataset is huge, it reads parquet files and downloads images from s3, assigns buckets based on aspect ratio and yields batches from the same bucket.
As bucketing is not supported in Ray Data we decided to convert the Dataset with ray.data.from_torch()
But I’ve noticed, that before yielding any data with iter_rows() for example, Ray dataset runs over the whole dataset in advance, which is in our case makes use of it impossible. But it seems like an odd, unintended behaviour. Is it really so? How can we circumvent this problem?
Here is the minimal reproducible example (To ease of debugging, I expanded the from_torch implementation)