How to get the global loss to train with pytorch?

Li_Bin · July 31, 2024, 8:42am

In each train_loop_per_worker, the loss is local . How can I get the report the global loss ?

ussesjenny · July 31, 2024, 9:42am

Hi,

I think you should to get the global loss in a distributed PyTorch training setup you can use collective communication operations like torch distributed all reduce to aggregate the local losses from all workers. Here is a brief outline:

Calculate the local loss on each worker.
Use torch.distributed.all_reduce to sum up the local losses.
Divide the aggregated loss by the number of workers to get the global loss.

Here is a simplified example:

import torch
import torch.distributed as dist

def train_loop_per_worker():
    # ... your training code ...

    local_loss = compute_local_loss()  # Replace with your loss computation

    # Aggregate the local losses to get the global loss
    dist.all_reduce(local_loss, op=dist.ReduceOp.SUM)
    global_loss = local_loss / dist.get_world_size()

    # Report the global loss
    return global_loss.item()

Ensure that your training script initializes the distributed process group appropriately using dist.init_process_group().

Thanks

Li_Bin · July 31, 2024, 11:35am

why doesn’t Ray.Train provide it since it is a common case in distributed tranning ? Just curious

Li_Bin · August 15, 2024, 11:32am

Hi,
Since Ray manages to wrap pytorch things for me , how to use dist.init_process_group() in Ray context?

Regards

matthewdeng · August 22, 2024, 8:41pm

You can also use existing Torch ecosystem libraries that do this for you, e.g. torchmetrics.

See an example here.

Topic		Replies	Views
How to get PyTorch losses from Ray Train? Ray Train	1	461	January 11, 2022
How can I synchronization metrics in `ray.train` valid_loop	15	604	March 1, 2023
Aggregation of distributed metrics Ray Train	1	606	March 4, 2022
[RAYSGD] Regular pytorch program perform better than raysgd Ray Client	3	456	October 26, 2021
Best practice to share a torch model across actors Ray Core	4	648	December 27, 2022

How to get the global loss to train with pytorch?

Related topics