Could not find best trial. Did you pass the correct `metric` parameter?

Hi,

I am new to Ray Tune, but I cannot figure out my mistake.
I am using Pytorch Lightning with the TuneReportCallback in order to pass the loss for Tune. However, it is not having it… Whatever I try I cannot get it to work. It runs perfectly fine, but at the end everytime the following is displayed:
Could not find best trial. Did you pass the correct metric parameter?

I stumbled upon this topic: Could not find best trial
And i tried to replicate that by adding manually in the training loop a tune.report(...), but that didn’t do anything as well. What am I doing wrong?

EDIT: I did some more digging, and it seems that ‘progress.csv’ stays empty, maybe that has something to do with it? And in Tensorboard there is no ‘ray’ sub-header, whereas in the Lightning-MNIST example there is…

The relevant code (I think) I use is as follows:
Trainer:

class LitTrainer(pl.LightningModule):
(...)
    def validation_epoch_end(self, outputs):
        avg_loss = torch.mean(torch.stack(outputs))
        self.log("val_loss", avg_loss, sync_dist=True)
        # tune.report(loss=avg_loss)

In main file:

def train_tune(config, args):
(...)
    logger = TensorBoardLogger(save_dir=tune.get_trial_dir(), name="", version=".")
    model = LitTrainer(netG=generator, config=config, args=args)

    trainer = pl.Trainer(
        gpus=args.gpus,
        max_epochs=args.max_epochs,
        logger=logger,
        log_every_n_steps=100,
        strategy='ddp_spawn',
        precision=args.precision,
        callbacks=[
            TuneReportCallback(
                {
                    "loss": "val_loss",
                },
                on="validation_end"),
        ],
        enable_progress_bar=False,
    )

    trainer.fit(
        model,
    )

def main():
    config = {
        'learning_rate': tune.loguniform(1e-4, 1e-2),
    }

    scheduler = ASHAScheduler(
        max_t=args.max_epochs,
        grace_period=1,
        reduction_factor=2)

    reporter = CLIReporter(
        parameter_columns=["learning_rate"],
        metric_columns=["loss", "training_iteration"])

    resources_per_trial = {'cpu': 24, 'gpu': args.gpus}

    train_fn_with_parameters = tune.with_parameters(
        train_tune,
        args=args
        )

    analysis = tune.run(
        train_fn_with_parameters,
        resources_per_trial=resources_per_trial,
        metric="loss",
        mode="min",
        config=config,
        num_samples=args.num_samples,
        scheduler=scheduler,
        progress_reporter=reporter,
        name='tune_test',
    )

    print('Best hyperparameters found were: ', analysis.best_config)

So what am I doing wrong?

Hey @rienboonstoppel ,

Based on the topic you linked, the issue could be that you’re reporting a Tensor:

avg_loss = torch.mean(torch.stack(outputs))

Could you try the suggested solution of converting this to a float?

Hi @matthewdeng,
No, it is not that it has problems with a Tensor. I tried the things that where mentioned in the linked topic, but that did not solve anything.
I figured it out just yet. Apparently Ray Tune is not able to cope with multipe GPUs. Because when I increase the number of GPUs it is not able to get the metrics. As strategy I tried both ddp and ddp_spawn, but that did not make a difference. So that is a bit of a bummer. But instead of training one model on multipe GPUs at a time, I will train multiple models on a single GPU parallel. Total runtime should be quite similar I think.

Ah I see, can you try using the PyTorch Lightning RayPlugin?