Aggregation of distributed metrics

caesar025 · March 4, 2022, 8:30pm

When training a model with PyTorch-DDP on multiple workers with Ray Train each worker holds its own value for a metric (e.g. the accuracy). Is there a way to aggregate these metrics inside of the training function in a similar fashion as it is possible with all_reduce in PyTorch?

matthewdeng · March 4, 2022, 10:38pm

Hey @caesar025,

You should still be able to use all_reduce directly in your training function to compute aggregate metrics on the distributed workers (as you normally would in a PyTorch-DDP script)!

In the future, Ray Train will expose an API that allow you to perform the aggregation of reported metrics on the Trainer side (for Callbacks).

Topic		Replies	Views
Distributed training with uneven inputs Ray Train	3	335	October 26, 2023
How can I synchronization metrics in `ray.train` valid_loop	15	612	March 1, 2023
Ray + torch.distributed/DDP resource management	1	1121	September 21, 2022
[Ray] How to implement distributed DDP in pytorch using only pytorch And ray? Ray Tune	1	826	July 28, 2021
Ray Tune does not work properly with DDP PyTorch Lightning Ray Tune	8	1632	March 17, 2022

Aggregation of distributed metrics

Related topics