Ray Dataset with Distributed PyTorch

sgwhat · February 21, 2022, 6:36am

Hi ,

I am working on a distributed PyTorch pipeline with Ray Dataset, I just have the question below:

After ds = ray.data.read_parquet("path", parallelism=20), then apply to_torch(), can I think the PyTorch dataset is a distributed dataset or just the same as usual one like torchvision.datasets.CIFAR10(‘dir’)`?

Thanks!

Clark_Zinzow · April 22, 2022, 5:13pm

Hi @sgwhat, thanks for posting! I just noticed that this post never got a response.

Yes, if you ran ds = ray.data.read_parquet() on a multi-node cluster, read tasks (and therefore your data) will be spread across the cluster. Therefore, all of the data under-the-hood of the returned Torch Dataset will be distributed, with the data pulled to the consumer (trainer) as you iterate over the dataset during training.

Topic		Replies	Views
How to Keep Tensor Shape w/Ray Datasets? Ray Data	2	470	June 16, 2022
Converting torch.utils.data.IterableDataset to Ray's Dataset Ray Data	4	778	April 13, 2022
Improve and verify the performance of code on Ray Ray Core	0	309	March 3, 2021
Keep PyTorch DataLoader when using Ray Data Ray Data	0	338	November 7, 2023
Can Ray Dataset be used between S3 and PyTorch? Ray Data	4	1160	February 17, 2022

Ray Dataset with Distributed PyTorch

Related topics