Lightning+ray+tensorboard = only one log

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am running ray tune for hyperparameter tuning on Windows, combining it with a pytorch lightning module and enabling tensorboard logging. I have mostly just followed the tutorial described here.

I expected that from tuning, the logger will log the validation loss for every hyperparameter configuration, and since there are 3 configurations, there should be 3 lines. However, I only see one plot when the tuning finishes.

I’d really like to have visibility over my models’ performance per hyperparameter configuration, so I hope this problem can be solved.

The following is the tuning code:

    num_epochs = 3
    num_samples = 10
    accelerator = 'gpu'
    checkpoint = CHECKPOINT

    config = {
        'checkpoint': checkpoint,
        'torch-seed': 0,
        'learning-rate': tune.loguniform(1e-5, 1e-7),
        # some other hyperparameters
    }

    dm_config = {
        "checkpoint": checkpoint,
        "batch-size": 20,
        "np-seed": 0,
        "validation-split": 0.2,
        "test-split": 0.0,
        "data-id": "main",
    }
    dm = DATAMODULE(dm_config)
    logger = TensorBoardLogger(save_dir=os.getcwd(), name=TB_LOGDIR)
    
    lightning_config = (
        LightningConfigBuilder()
        .module(cls=MODEL, config=config)
        .trainer(
            max_epochs=num_epochs, 
            accelerator=accelerator, 
            logger=logger, 
            enable_progress_bar=False,
            log_every_n_steps=1,
        )
        .fit_params(datamodule=dm)
        .checkpointing(monitor="Loss/avg_valid", save_top_k=2, mode="min")
        .build()
    )

    # define AIR checkpointconfig to properly save checkpoints in AIR format.
    run_config = RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=2,
            checkpoint_score_attribute="Loss/avg_valid",
            checkpoint_score_order="min",
        )
    )

    # decide at each iteration which trials perform badly and stop them.
    scheduler = ASHAScheduler(max_t=num_epochs, grace_period=1, reduction_factor=2)

    scaling_config = ScalingConfig(
        num_workers=1,
        use_gpu=True,
        resources_per_worker={"CPU": 1, "GPU": 1},
    )
    lightning_trainer = LightningTrainer(
        scaling_config=scaling_config, run_config=run_config,
    )
    
    tuner = tune.Tuner(
        lightning_trainer,
        param_space={"lightning_config": lightning_config},
        tune_config=tune.TuneConfig(
            metric="Loss/avg_valid",
            mode="min",
            # num_samples=10,
            scheduler=scheduler,
        ),
        run_config=air.RunConfig(
            name="tune_clip",
        ),
    )

    results = tuner.fit()
    best_result= results.get_best_result(metric="Loss/avg_valid", mode="min")

The MODEL class inherits from pytorch_lightning.LightningModule, and its on_validation_epoch_end(self) is overridden like so:

    def on_validation_epoch_end(self):
        avg_valid_loss = torch.stack(self.valid_loss_list).mean()
        # self.log("Loss/avg_valid", avg_valid_loss, prog_bar=False, on_epoch=True, sync_dist=True)
        self.log_dict({'Loss/avg_valid': avg_valid_loss, 'step': self.current_epoch+1.0})
        self.avg_valid_loss = avg_valid_loss
        self.valid_loss_list.clear()

I’m not very familiar with ray, but I suspect that because the logger is defined on the configuration level and passed into the trainer, it is fixed for every hyperparameter configuration and so overrides data from previous training sessions. If true however it doesn’t really tell me how to solve the problem. Could anyone help take a look? Thanks!

Hi @kper22020, yeah the logger might write to the same directory from multiple runs.

Actually, Ray Tuner already logged the reported metrics with ray.tune.logger.TBXLoggerCallback. You can find the log artifact from the trial folder. e.g. {experiment_name}/LightningTrainer_abcde_*/events.out.tfevents.1687466006.g-e7ca70940d85f0001

If you still want to use Lightning’s TensorboardLogger, a workaround is to set save_dir as a relative path. Then you can find it in each trial’s artifact folder. Like {experiment_name}/LightningTrainer_abcde_*/rank_0/{save_dir}

@yunxuanx Thanks, I wasn’t aware of this, I will take a look at the callback and see if I can get it to work.

1 Like