How to get PyTorch losses from Ray Train?

Lacruche · January 11, 2022, 4:43pm

Hi,

I’m training a PyTorch seg model with Ray Train on 4 NVIDIA T4s (EC2 g4.12xlarge)

My validation section is the following snippet.

# val only on rank 0
if i > 0 and (i / float(args.log_freq)).is_integer() and train.local_rank() == 0:

    bstop = time.time()
    val_losses = []
    model.eval()
    with torch.no_grad():

        for j, batch in enumerate(val_loader):
            inputs, masks = batch
            inputs = inputs.to(device)
            masks = masks.to(device)
            outputs = model(inputs)
            val_loss = CE(outputs["out"], masks)
            val_losses.append(val_loss)

    avg_val_loss = torch.mean(torch.stack(val_losses))

    # print metrics
    throughput = float((i + 1) * batch_size) / (bstop - bstart)
    cluster_throughput = throughput * train.world_size()
    print(f'processed {i*batch_size} records in {bstop-bstart}s')
    print(
        "batch {}: Training_loss: {:.4f}, Val_loss: {:.4f}, Throughput: {}, "+
        "Approx cluster throughput: {}".format(
            i, loss, avg_val_loss, throughput, cluster_throughput
        )
    )

When I run it out of Ray Train (PyTorch script), it prints losses correctly.
When I run it in Ray Train, nothing comes up in the losses:

Is there something to do so that PyTorch losses can be accessed and printed in a Ray Train training script?

matthewdeng · January 11, 2022, 5:15pm

I think this is because your format string has been split across two lines!

"batch {}: Training_loss: {:.4f}, Val_loss: {:.4f}, Throughput: {}, "+
        "Approx cluster throughput: {}".format("

So format is only being called on "Approx cluster throughput: {}.

Topic		Replies	Views
Ray Train silent for 7 min Ray Train	1	466	January 7, 2022
[Tune Class API + Pytorch] Custom metrics are not properly passed to ExperimentAnalysis and Tensorboard Ray Tune	2	384	March 29, 2021
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.checkpoint()` are called the same number of times on all workers Ray Train	1	694	April 16, 2022
Ray Train code works locally, not in SageMaker PyTorch job Ray Train	15	1135	January 12, 2022
How to check training and validation distributed properly on the ray cluster Ray Train	2	860	August 26, 2022

How to get PyTorch losses from Ray Train?

Related topics