Aggregation of distributed metrics

When training a model with PyTorch-DDP on multiple workers with Ray Train each worker holds its own value for a metric (e.g. the accuracy). Is there a way to aggregate these metrics inside of the training function in a similar fashion as it is possible with all_reduce in PyTorch?

Hey @caesar025,

You should still be able to use all_reduce directly in your training function to compute aggregate metrics on the distributed workers (as you normally would in a PyTorch-DDP script)!

In the future, Ray Train will expose an API that allow you to perform the aggregation of reported metrics on the Trainer side (for Callbacks).

1 Like