Hi,
I’m training a PyTorch seg model with Ray Train on 4 NVIDIA T4s (EC2 g4.12xlarge)
My validation section is the following snippet.
# val only on rank 0
if i > 0 and (i / float(args.log_freq)).is_integer() and train.local_rank() == 0:
bstop = time.time()
val_losses = []
model.eval()
with torch.no_grad():
for j, batch in enumerate(val_loader):
inputs, masks = batch
inputs = inputs.to(device)
masks = masks.to(device)
outputs = model(inputs)
val_loss = CE(outputs["out"], masks)
val_losses.append(val_loss)
avg_val_loss = torch.mean(torch.stack(val_losses))
# print metrics
throughput = float((i + 1) * batch_size) / (bstop - bstart)
cluster_throughput = throughput * train.world_size()
print(f'processed {i*batch_size} records in {bstop-bstart}s')
print(
"batch {}: Training_loss: {:.4f}, Val_loss: {:.4f}, Throughput: {}, "+
"Approx cluster throughput: {}".format(
i, loss, avg_val_loss, throughput, cluster_throughput
)
)
When I run it out of Ray Train (PyTorch script), it prints losses correctly.
When I run it in Ray Train, nothing comes up in the losses:
Is there something to do so that PyTorch losses can be accessed and printed in a Ray Train training script?