Can Ray Dataset be used between S3 and PyTorch?

Lacruche · February 9, 2022, 7:26pm

Hi,

I have a PyTorch training on EC2 (plain pytorch, not Ray Train) that trains on thousands of large images (100Mb each). I’m hesitating between:

creating a map-style PyTorch Dataset class fetching images from S3 in the getitem (possibly with a local cache to make epoch 2 read locally),
reading multi-image files sequentially via an iterable dataset, powered by Webdataset. This seems to be SOTA yet doc is a bit sparse

I’m satisfied with neither: option 1 is transparent, easy, but verbose. Option 2 is performant but more opaque (+ requires an additional tar compression step)

I’m wondering: can Ray Datasets save my day here? And power efficient dataloading of large pictures in S3 to a PyTorch script? How?

Clark_Zinzow · February 9, 2022, 8:35pm

Hi @Lacruche, Datasets should work great here! Datasets provides an API for reading binary files such as imagery, and has a .to_torch() API that yields a familiar PyTorch IterableDataset that you can directly consume within your PyTorch trainer:

ds_pipe = ray.data.read_binary_files("s3://some/bucket") \
    .map(lambda bytes: {"data": to_image(bytes), "label": to_label(bytes)}) \
    .repeat(num_epochs)

for epoch, ds in enumerate(ds_pipe.iter_epochs()):
    torch_ds: torch.utils.data.IterableDataset = ds.to_torch(...)
    for batch_idx, data in enumerate(torch_ds):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

The imagery will be read into and held in distributed memory, and will be fed to your trainer(s) with options for prefetching, batching, etc.

amogkam · February 10, 2022, 7:25pm

Btw @Lacruche for high performant image loading, I was wondering if you have checked out https://github.com/libffcv/ffcv

They tout better performance than Webdataset

Lacruche · February 15, 2022, 10:54am

interesting, I wasn’t aware of it! thanks

istranic · February 17, 2022, 11:40pm

Hey @Lacruche . Another alternate you could consider is Activeloop Hub. It’s a OSS format optimized for computer vision data, and it’s primary benefit is the ability to stream data while training models. The Python API is also very simple, and you can continue to store your data in your S3 Bucket.

Full disclosure, I work for Activeloop, but we’d love if you give it a shot.

Topic		Replies	Views
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1259	February 14, 2023
How to stream data directly from s3 Ray Train	2	439	March 4, 2024
Using ray datasets with pytorch lightning	0	321	November 22, 2023
Ray dataset cannot read and parse image image dataset from S3	12	951	August 14, 2023
Ray dataset with multiple images per batch	5	220	September 1, 2023

Can Ray Dataset be used between S3 and PyTorch?

Related topics