Converting wds.WebLoader for training

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am attempting to convert a WebLoader into a Ray DataLoader in the open_clip repo. The webloader already have information about how to split between workers, … so I’d like to keep that.

When applying the standard process (wrapping the dataloader into ray.train.torch.prepare_data_loader, I get this following error:

  File ".../ray/train/torch/train_loop_utils.py", line 391, in prepare_data_loader
    and not isinstance(data_loader.sampler, DistributedSampler)
AttributeError: 'WebLoader' object has no attribute 'sampler'

So 2 questions here:

  • What’s the best way to convert a wds.WebLoader to a Ray-compliant data loader?
  • Less important: How can I carry over the parameters (such as split by worker)?

Thanks!

PS: I’m using the typical TorchTrainer. I see some folks using lower level primitive like this. Is this the recommendation when one wants little refactoring to do?

I forwarded this question to the Ray Train team, they will be able to provide the most accurate answer regarding WebLoader and Ray Train.

Another alternative is to use the Ray Data read_webdataset method, then pass this into a Ray Trainer. Ray Train + Ray Data handles the data splitting across workers automatically, which you can read more here. By combining read_webdataset() + ray Trainer, you may me able to achieve the same objective without relying on WebLoader.

Thanks. I actually had some help offline and one doesn’t need to wrap the WebLoader with prepare_data_loader: it works out of the box.

Thanks for investigating.