How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’m using Ray to tune a PyTorch model. I’ve set up my tuning as such:
config = {
"l1": tune.sample_from(lambda _: 2 ** np.random.randint(5, 8)),
"l2": tune.sample_from(lambda _: 2 ** np.random.randint(5, 8)),
}
scheduler = ASHAScheduler(max_t = 10)
reporter = CLIReporter(metric_columns=["loss"])
result = tune.run(
partial(run_training, n_fp = n_fp, train_loader = train_loader, test_loader = test_loader, device = device, EPOCHS = 1, tuning = True),
resources_per_trial={"cpu": 1, "gpu": 1},
config=config,
metric = "loss",
mode = "min",
num_samples= 2,
scheduler=scheduler,
progress_reporter = reporter,
)
In my run_training() function I’ve set up the tune.report()
tune.report(loss=avg_val_loss.cpu().detach().numpy())
To get the best trial I’m using
best = result.get_best_trial(metric = "loss", mode = "min", scope = "last")
I’ve made sure that my reported loss is not a tensor, in reference to previous post.
I’m still getting the error
Could not find best trial. Did you pass the correct `metric` parameter?
The weird part is, when my results print during running trials, I do see my loss being reported as such:
Result for run_training_3516f_00001:
date: 2022-06-08_14-24-31
done: true
experiment_id: 9f4d5e3284d348a980987a33f0b76416
iterations_since_restore: 1
loss: 0.015924770385026932
pid: 406953
time_since_restore: 3.5238189697265625
time_this_iter_s: 3.5238189697265625
time_total_s: 3.5238189697265625
timestamp: 1654712671
timesteps_since_restore: 0
training_iteration: 1
trial_id: 3516f_00001
warmup_time: 0.003016948699951172
During reporting I even see:
Current best trial: 4dcc3_00000 with loss=0.009870701469480991 and parameters={'l1': 32, 'l2': 64}
Looking into the get_best_trial() function, I see that it references, trial.metric_analysis[metric].
When I manually print the keys for this the metric key “loss” is not there.
print([trial.metric_analysis.keys() for trial in result.trials])
-> dict_keys(['time_this_iter_s', 'done', 'training_iteration', 'time_total_s', 'time_since_restore', 'timesteps_since_restore', 'iterations_since_restore', 'warmup_time'])
However, when I do the result._validate_metric(“loss”), the result is ‘loss’, and I’ve verified that result.default_metric is also ‘loss.’
I don’t know why my metric that I’ve initialized isn’t showing up in the final evaluation and is preventing me from finding the best trial. Especially since a loss is being reported throughout the experiments.