Metric key is not in trial.metric_analysis -- Not able to find best_trial

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m using Ray to tune a PyTorch model. I’ve set up my tuning as such:

config = {
        "l1": tune.sample_from(lambda _: 2 ** np.random.randint(5, 8)),
        "l2": tune.sample_from(lambda _: 2 ** np.random.randint(5, 8)),
    }
scheduler = ASHAScheduler(max_t = 10)
reporter = CLIReporter(metric_columns=["loss"])
result = tune.run(
        partial(run_training, n_fp = n_fp, train_loader = train_loader, test_loader = test_loader, device = device, EPOCHS = 1, tuning = True),
        resources_per_trial={"cpu": 1, "gpu": 1},
        config=config,
        metric = "loss",
        mode = "min",
        num_samples= 2,
        scheduler=scheduler,
        progress_reporter = reporter, 
        )

In my run_training() function I’ve set up the tune.report()

tune.report(loss=avg_val_loss.cpu().detach().numpy())

To get the best trial I’m using

best = result.get_best_trial(metric = "loss", mode = "min", scope = "last")

I’ve made sure that my reported loss is not a tensor, in reference to previous post.

I’m still getting the error

Could not find best trial. Did you pass the correct `metric` parameter?

The weird part is, when my results print during running trials, I do see my loss being reported as such:

Result for run_training_3516f_00001:
  date: 2022-06-08_14-24-31
  done: true
  experiment_id: 9f4d5e3284d348a980987a33f0b76416
  iterations_since_restore: 1
  loss: 0.015924770385026932
  pid: 406953
  time_since_restore: 3.5238189697265625
  time_this_iter_s: 3.5238189697265625
  time_total_s: 3.5238189697265625
  timestamp: 1654712671
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 3516f_00001
  warmup_time: 0.003016948699951172

During reporting I even see:

Current best trial: 4dcc3_00000 with loss=0.009870701469480991 and parameters={'l1': 32, 'l2': 64}

Looking into the get_best_trial() function, I see that it references, trial.metric_analysis[metric].
When I manually print the keys for this the metric key “loss” is not there.

print([trial.metric_analysis.keys() for trial in result.trials])
-> dict_keys(['time_this_iter_s', 'done', 'training_iteration', 'time_total_s', 'time_since_restore', 'timesteps_since_restore', 'iterations_since_restore', 'warmup_time'])

However, when I do the result._validate_metric(“loss”), the result is ‘loss’, and I’ve verified that result.default_metric is also ‘loss.’

I don’t know why my metric that I’ve initialized isn’t showing up in the final evaluation and is preventing me from finding the best trial. Especially since a loss is being reported throughout the experiments.

@ajain300 thanks for posting this!

Do you mind sharing the full stack trace you are seeing? One reason you are seeing this error is because not every single result contains the loss key. Can you verify if loss is being reported every time tune.report() is called?

Also if you are able to share a small reproducible example that would help a lot!