How to report loss when using more than one worker?

mitul93 · May 19, 2025, 12:29pm

1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.

2. Environment:

Ray version: 2.45
Python version: 3.10
OS: Ubuntu 24.04
Cloud/Infrastructure: AWS

I’m exploring ray and I’m very new to this framework.

I’m using TorchTrainer to train a model and would like to use 4 GPUs to speed up the training. I’m using MLFlow for experiment tracking and reporting loss.

There is an example code that shows how to use MLflow in Ray Train.

I’m wondering if loss for only 0th rank worker is reported in MLflow, How do I calculate overall loss? There are four independent workers each with different loss.

Thanks
MV

christina · May 20, 2025, 12:02am

Hello! I looked at the example you listed and it seems like it does only calculate the loss for the 0th rank worker:

            if ray.train.get_context().get_world_rank() == 0:
                mlflow.log_metrics({"loss": loss.item(), "epoch": epoch})

So I think you are correct, the loss.item() here is the loss calculated by the rank 0 worker on its specific batch of data.

I think if you remove the if statement that specifically checks if it’s the 0th rank worker it should begin reporting losses correctly. Lmk if that helps!

mitul93 · May 20, 2025, 6:17am

Hi Christina

Thank you for a prompt response.

Removing the if condition starts reporting loss from all training workers.

I would like to combine loss from all workers and calculate a single value that can be reported to MLflow. Would you point me in the right direction please?

Thank you.
MV

Topic		Replies	Views
How to get PyTorch losses from Ray Train? Ray Train	1	478	January 11, 2022
How to use wandb logging when have multiple workers	0	54	June 27, 2025
Ray Train sync from worker to head	6	535	April 10, 2023
How to get the global loss to train with pytorch? Ray Train	4	110	August 22, 2024
How to print Ray Train logs from 1 worker out of N? Ray Train	3	578	January 11, 2022

How to report loss when using more than one worker?

Related topics