Synchronizing workers during ray train

pratkpranav · May 31, 2023, 11:57am

Hi Everyone! We are trying to use Ray Train to work for distributed training. We have this use case where we need each worker to be synchronized in between the training. That is to run a function for some time, synchronize, and then run some other functions. So, is there a way to achieve it?

Jules_Damji · May 31, 2023, 6:42pm

@pratkpranav I’m not quite clear what you mean by

each worker to be synchronized between training
run function for sometime, synchronize, run other functions

In what context are you using “synchronize” here? Perhaps a concrete example would help?

cc: @Yard1 @amogkam

pratkpranav · May 31, 2023, 8:44pm

@Jules_Damji Sorry if it was not very clear. I was wondering whether there is a way to have functionality similar to MPI_Barrier, which blocks all the workers until all of them have called that function and then continues running from the next lines synchronously. Something like this:

def training_loop_per_worker(config):
    model = config.get('model')
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        <wait for all the workers to reach this line>

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

justinvyu · May 31, 2023, 10:45pm

Hi @pratkpranav,

The main API to interact with Ray Train from inside your custom training loop is session.report, and this actually does serve as a synchronization barrier for workers. Training will only progress once all workers have reported.

However, session.report is mainly used to report metrics and checkpoints at the end of each epoch. If you want to use it as a barrier, you’d have to report some dummy metrics. What’s the intended use case for this?

Synchronizing gradients between workers is already handled by the Torch distributed backend – just make sure you call ray.train.torch.prepare_model and ray.train.torch.prepare_data_loader on your model/dataloader. See here.

pratkpranav · June 1, 2023, 5:51am

Hi @justinvyu ,

Thanks for your reply. We are currently working on a machine-learning engine and creating a distributed training framework using Ray Trainer for it. Unfortunately, our backend currently lacks the capability to pause model training before gradient synchronization. Thus, I’m curious if Ray Trainer can solve this issue. Although session.report reports metrics and checkpoints at the end of each epoch, it may not be the most suitable approach in this case. I’m wondering if there are alternative methods available to accomplish our goal.

justinvyu · June 1, 2023, 9:43pm

@pratkpranav

Ray actually does provide some basic communication primitives (including barrier). Take a look here: Ray Collective Communication Lib — Ray 2.4.0

This blog post may also be of use: Introducing Collective Communication Primitive APIs in Ray | Anyscale

I believe this is not under active development at the moment, but let me know if this suits your needs and if you run into any difficulties.

You can also consider using the communication backend of torch.distributed: Distributed communication package - torch.distributed — PyTorch 2.0 documentation

Jules_Damji · June 1, 2023, 10:30pm

Thanks @justinvyu for the responses.

pratkpranav · June 2, 2023, 11:09am

Thanks @justinvyu, @Jules_Damji! This is really helpful.

JmPearl · February 25, 2025, 2:59am

If a trainer.barrier() method is used after rat.train.report can this cause a streamsplitdataiterrator block? I’m modifying a checkpoint callback, and only want local_rank 0 to save the checkpoint with train report, and then otherwise report with None as the checkpoint. When I do this followed by trainer.barrier() at the end of my callback method I get the block.

Topic		Replies	Views
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.checkpoint()` are called the same number of times on all workers Ray Train	1	695	April 16, 2022
Distributed training with uneven inputs Ray Train	3	344	October 26, 2023
How to verify if training is happening in parallel in Ray?	3	472	March 24, 2021
How to print Ray Train logs from 1 worker out of N? Ray Train	3	551	January 11, 2022
Ray Train silent for 7 min Ray Train	1	467	January 7, 2022

Synchronizing workers during ray train

Related topics