How to get PyTorch losses from Ray Train?

Hi,

I’m training a PyTorch seg model with Ray Train on 4 NVIDIA T4s (EC2 g4.12xlarge)

My validation section is the following snippet.

# val only on rank 0
if i > 0 and (i / float(args.log_freq)).is_integer() and train.local_rank() == 0:

    bstop = time.time()
    val_losses = []
    model.eval()
    with torch.no_grad():

        for j, batch in enumerate(val_loader):
            inputs, masks = batch
            inputs = inputs.to(device)
            masks = masks.to(device)
            outputs = model(inputs)
            val_loss = CE(outputs["out"], masks)
            val_losses.append(val_loss)

    avg_val_loss = torch.mean(torch.stack(val_losses))

    # print metrics
    throughput = float((i + 1) * batch_size) / (bstop - bstart)
    cluster_throughput = throughput * train.world_size()
    print(f'processed {i*batch_size} records in {bstop-bstart}s')
    print(
        "batch {}: Training_loss: {:.4f}, Val_loss: {:.4f}, Throughput: {}, "+
        "Approx cluster throughput: {}".format(
            i, loss, avg_val_loss, throughput, cluster_throughput
        )
    )

When I run it out of Ray Train (PyTorch script), it prints losses correctly.
When I run it in Ray Train, nothing comes up in the losses:

Is there something to do so that PyTorch losses can be accessed and printed in a Ray Train training script?

I think this is because your format string has been split across two lines!

"batch {}: Training_loss: {:.4f}, Val_loss: {:.4f}, Throughput: {}, "+
        "Approx cluster throughput: {}".format("

So format is only being called on "Approx cluster throughput: {}.

1 Like