How to report loss when using more than one worker?

1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.

2. Environment:

  • Ray version: 2.45
  • Python version: 3.10
  • OS: Ubuntu 24.04
  • Cloud/Infrastructure: AWS

I’m exploring ray and I’m very new to this framework.

I’m using TorchTrainer to train a model and would like to use 4 GPUs to speed up the training. I’m using MLFlow for experiment tracking and reporting loss.

There is an example code that shows how to use MLflow in Ray Train.

I’m wondering if loss for only 0th rank worker is reported in MLflow, How do I calculate overall loss? There are four independent workers each with different loss.

Thanks
MV

Hello! I looked at the example you listed and it seems like it does only calculate the loss for the 0th rank worker:

            if ray.train.get_context().get_world_rank() == 0:
                mlflow.log_metrics({"loss": loss.item(), "epoch": epoch})

So I think you are correct, the loss.item() here is the loss calculated by the rank 0 worker on its specific batch of data.

I think if you remove the if statement that specifically checks if it’s the 0th rank worker it should begin reporting losses correctly. Lmk if that helps!

Hi Christina

Thank you for a prompt response.

Removing the if condition starts reporting loss from all training workers.

I would like to combine loss from all workers and calculate a single value that can be reported to MLflow. Would you point me in the right direction please?

Thank you.
MV