Hi,
I am new to Ray Tune, but I cannot figure out my mistake.
I am using Pytorch Lightning with the TuneReportCallback in order to pass the loss for Tune. However, it is not having it… Whatever I try I cannot get it to work. It runs perfectly fine, but at the end everytime the following is displayed:
Could not find best trial. Did you pass the correct
metric parameter?
I stumbled upon this topic: Could not find best trial
And i tried to replicate that by adding manually in the training loop a tune.report(...)
, but that didn’t do anything as well. What am I doing wrong?
EDIT: I did some more digging, and it seems that ‘progress.csv’ stays empty, maybe that has something to do with it? And in Tensorboard there is no ‘ray’ sub-header, whereas in the Lightning-MNIST example there is…
The relevant code (I think) I use is as follows:
Trainer:
class LitTrainer(pl.LightningModule):
(...)
def validation_epoch_end(self, outputs):
avg_loss = torch.mean(torch.stack(outputs))
self.log("val_loss", avg_loss, sync_dist=True)
# tune.report(loss=avg_loss)
In main file:
def train_tune(config, args):
(...)
logger = TensorBoardLogger(save_dir=tune.get_trial_dir(), name="", version=".")
model = LitTrainer(netG=generator, config=config, args=args)
trainer = pl.Trainer(
gpus=args.gpus,
max_epochs=args.max_epochs,
logger=logger,
log_every_n_steps=100,
strategy='ddp_spawn',
precision=args.precision,
callbacks=[
TuneReportCallback(
{
"loss": "val_loss",
},
on="validation_end"),
],
enable_progress_bar=False,
)
trainer.fit(
model,
)
def main():
config = {
'learning_rate': tune.loguniform(1e-4, 1e-2),
}
scheduler = ASHAScheduler(
max_t=args.max_epochs,
grace_period=1,
reduction_factor=2)
reporter = CLIReporter(
parameter_columns=["learning_rate"],
metric_columns=["loss", "training_iteration"])
resources_per_trial = {'cpu': 24, 'gpu': args.gpus}
train_fn_with_parameters = tune.with_parameters(
train_tune,
args=args
)
analysis = tune.run(
train_fn_with_parameters,
resources_per_trial=resources_per_trial,
metric="loss",
mode="min",
config=config,
num_samples=args.num_samples,
scheduler=scheduler,
progress_reporter=reporter,
name='tune_test',
)
print('Best hyperparameters found were: ', analysis.best_config)
So what am I doing wrong?