Distributed training with uneven inputs

ben1 · October 12, 2023, 8:35am

I’m trying to replace torch.DDP with ray.train, and hit the problem of different DDP instance having different number of input batches.

DDP works with the ‘Join’ context manager can fix this problem, as shown below.
How to handle this problem using ray.train?

...
    model = DDP(torch.nn.Linear(1, 1).to(rank), device_ids=[rank])
    with model.join():
        for input in inputs:
            loss = model(input).sum()
            ...

https://pytorch.org/tutorials/advanced/generic_join.html#using-join-with-distributeddataparallel

matthewdeng · October 13, 2023, 11:36pm

Hey @ben1, Ray Train supports Torch DDP and you should be able to run this as well.

ben1 · October 17, 2023, 12:50pm

Yes, I find out that the actually blocking part is train.report.
Due to the uneven inputs, some worker finish the dataloader loop early. But train.report must be called the same number of times by all workers.

How to handle this problem using ray.train?

with model.join():
        iter_idx = 0
        for input in inputs:
            loss = model(input).sum()
            iter_idx += 1
            if iter_idx % report_num == 0:
                train.report(metrics)
            ...

justinvyu · October 26, 2023, 5:17pm

@ben1 It may be easier to just report one set of aggregated metrics at the end of the epoch, rather than every N steps within the loop.

for epoch in range(epochs):
    for input in inputs:  
        ...
    if epoch % report_num == 0:
        train.report(aggregated_metrics)

train.report does act as a barrier to maintain a consistent metric/checkpoint report iteration across workers.

Topic		Replies	Views
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.checkpoint()` are called the same number of times on all workers Ray Train	1	684	April 16, 2022
Synchronizing workers during ray train Ray Train	8	871	February 25, 2025
Ray Trainer prepare_model gets stuck Ray Train	6	1023	June 6, 2022
How to check training and validation distributed properly on the ray cluster Ray Train	2	854	August 26, 2022
Aggregation of distributed metrics Ray Train	1	632	March 4, 2022

Distributed training with uneven inputs

Related topics