Can Ray Dataset be used between S3 and PyTorch?

Hi,

I have a PyTorch training on EC2 (plain pytorch, not Ray Train) that trains on thousands of large images (100Mb each). I’m hesitating between:

  1. creating a map-style PyTorch Dataset class fetching images from S3 in the getitem (possibly with a local cache to make epoch 2 read locally),
  2. reading multi-image files sequentially via an iterable dataset, powered by Webdataset. This seems to be SOTA yet doc is a bit sparse

I’m satisfied with neither: option 1 is transparent, easy, but verbose. Option 2 is performant but more opaque (+ requires an additional tar compression step)

I’m wondering: can Ray Datasets save my day here? And power efficient dataloading of large pictures in S3 to a PyTorch script? How?

Hi @Lacruche, Datasets should work great here! Datasets provides an API for reading binary files such as imagery, and has a .to_torch() API that yields a familiar PyTorch IterableDataset that you can directly consume within your PyTorch trainer:

ds_pipe = ray.data.read_binary_files("s3://some/bucket") \
    .map(lambda bytes: {"data": to_image(bytes), "label": to_label(bytes)}) \
    .repeat(num_epochs)

for epoch, ds in enumerate(ds_pipe.iter_epochs()):
    torch_ds: torch.utils.data.IterableDataset = ds.to_torch(...)
    for batch_idx, data in enumerate(torch_ds):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

The imagery will be read into and held in distributed memory, and will be fed to your trainer(s) with options for prefetching, batching, etc.

1 Like

Btw @Lacruche for high performant image loading, I was wondering if you have checked out https://github.com/libffcv/ffcv

They tout better performance than Webdataset

1 Like

interesting, I wasn’t aware of it! thanks

Hey @Lacruche . Another alternate you could consider is Activeloop Hub. It’s a OSS format optimized for computer vision data, and it’s primary benefit is the ability to stream data while training models. The Python API is also very simple, and you can continue to store your data in your S3 Bucket.

Full disclosure, I work for Activeloop, but we’d love if you give it a shot.