When training a model with PyTorch-DDP on multiple workers with Ray Train each worker holds its own value for a metric (e.g. the accuracy). Is there a way to aggregate these metrics inside of the training function in a similar fashion as it is possible with all_reduce in PyTorch?
Hey @caesar025,
You should still be able to use all_reduce
directly in your training function to compute aggregate metrics on the distributed workers (as you normally would in a PyTorch-DDP script)!
In the future, Ray Train will expose an API that allow you to perform the aggregation of reported metrics on the Trainer side (for Callbacks).
1 Like