I have a PyTorch training on EC2 (plain pytorch, not Ray Train) that trains on thousands of large images (100Mb each). I’m hesitating between:
creating a map-style PyTorch Dataset class fetching images from S3 in the getitem (possibly with a local cache to make epoch 2 read locally),
reading multi-image files sequentially via an iterable dataset, powered by Webdataset. This seems to be SOTA yet doc is a bit sparse
I’m satisfied with neither: option 1 is transparent, easy, but verbose. Option 2 is performant but more opaque (+ requires an additional tar compression step)
I’m wondering: can Ray Datasets save my day here? And power efficient dataloading of large pictures in S3 to a PyTorch script? How?
Hi @Lacruche, Datasets should work great here! Datasets provides an API for reading binary files such as imagery, and has a .to_torch()API that yields a familiar PyTorch IterableDataset that you can directly consume within your PyTorch trainer:
ds_pipe = ray.data.read_binary_files("s3://some/bucket") \
.map(lambda bytes: {"data": to_image(bytes), "label": to_label(bytes)}) \
.repeat(num_epochs)
for epoch, ds in enumerate(ds_pipe.iter_epochs()):
torch_ds: torch.utils.data.IterableDataset = ds.to_torch(...)
for batch_idx, data in enumerate(torch_ds):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
The imagery will be read into and held in distributed memory, and will be fed to your trainer(s) with options for prefetching, batching, etc.
Hey @Lacruche . Another alternate you could consider is Activeloop Hub. It’s a OSS format optimized for computer vision data, and it’s primary benefit is the ability to stream data while training models. The Python API is also very simple, and you can continue to store your data in your S3 Bucket.
Full disclosure, I work for Activeloop, but we’d love if you give it a shot.