Converting to Ray's Dataset

We have the data coming in from Is there a way to convery it to Ray dataset to reuse Ray’s parallellism?
I see from the below link that PyTorch Iterable Dataset can be derived from dataset using ds.to_torch() but not the other way round.


Hi @pratap123, interesting, we’ve typically had users going from a Ray Dataset → Torch IterableDataset, not the other way around!

The closest that we currently have is, which would require materializing the IterableDataset all at once: ds =

May I ask how the data in IterableDataset is generated and how you’d like this conversion to take place? Would you want this conversion to be done in a streaming fashion, where the IterableDataset is gradually consumed?

@Clark_Zinzow Thanks for the response… Main aim is to read the data in parallel produced by Torch Iterable Dataset. Hence I am looking to convert it to ray data asset to utilize Ray’s parallelism.

@pratap123 Understood! The issue here is that the Torch IterableDataset interface is an iterator, which is a streaming interface, so it won’t be possible for Datasets to parallelize the conversion of IterableDataset. If you did, parallelism=10), all future Dataset operations on the data will be parallelized over 10 blocks of data, but this would required consuming the entire IterableDataset to construct the Ray Dataset.

Could you tell me more about how the data within IterableDataset is generated?

Thanks @Clark_Zinzow for the quick response.
Unfortunately I dont have much details about IterableDataset since it comes from different component to us. We read that data .
All I am trying to do is to parallelize the data read. We have used ray for processing the data once its read but it didnt help much because data read is serial and is taking time, SO I am trying to parallelize that too.