Calculating single metric value for dataset

I’ve just managed to get ray train working and I now want to track overall progress in the report.

I’m using torchmetrics to calculate an overall MSE for a dataset, how can I do this to produce a single value for a training epoch? I have to use batches for training, so the way I’ve previously done it prior to using ray was to save all outputs/labels into a list to concatenate prior to calculating the metric. e.g.

all_outputs = []
all_labels = []

for x, labels in train_loader:
	outputs = model.forward(x, True)
	outputs = torch.squeeze(outputs) # network has single output, so squeeze to match labels

	all_outputs.append(outputs.detach())
	all_labels.append(labels)

all_outputs = torch.cat(all_outputs)
all_labels = torch.cat(all_labels)

mse = torchmetrics.functional.mean_squared_error(all_outputs, all_labels).item()

I have 2 workers running, so currently when I do this I end up with two mse values in the report. Is there anyway to do this so I can get a single value for the entire dataset rather than two values?

How are you reporting/sacing your MSE metrics?

Ray Train natively supports torchmetrics, so if you use train.report(), you should be able to get both values and be able to access them in a callback (see e.g. here: Ray Train User Guide — Ray 1.12.1). You could then aggregate them in the callback for further processing, if desired.