Distributed training in PyTorch and init_process_group

matthewdeng · September 7, 2021, 2:53pm

Hey @vblagoje, the reason for this behavior is that under the hood each iteration will wait for all processes to report metrics via tune.report before continuing training (though ultimately only the metrics from the worker process with rank 0 will be propagated up) - you can think of this as a way to ensure that all processes are synchronized.

Would you be able to invoke tune.report on all workers?

Topic		Replies	Views
Ray Tune for single-node distributed training in PyTorch Ray Tune	3	996	August 24, 2021
Ray + torch.distributed/DDP resource management	1	1161	September 21, 2022
Tune with Function API and torch.multiprocessing.spawn	0	290	February 6, 2024
Launching distributed training from within an actor	4	294	August 30, 2023
Ray Tune does not work properly with DDP PyTorch Lightning Ray Tune	8	1647	March 17, 2022

Distributed training in PyTorch and init_process_group

Related topics