Lightning+ray+tensorboard = only one log

kper22020 · June 22, 2023, 11:22pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am running ray tune for hyperparameter tuning on Windows, combining it with a pytorch lightning module and enabling tensorboard logging. I have mostly just followed the tutorial described here.

I expected that from tuning, the logger will log the validation loss for every hyperparameter configuration, and since there are 3 configurations, there should be 3 lines. However, I only see one plot when the tuning finishes.

I’d really like to have visibility over my models’ performance per hyperparameter configuration, so I hope this problem can be solved.

The following is the tuning code:

    num_epochs = 3
    num_samples = 10
    accelerator = 'gpu'
    checkpoint = CHECKPOINT

    config = {
        'checkpoint': checkpoint,
        'torch-seed': 0,
        'learning-rate': tune.loguniform(1e-5, 1e-7),
        # some other hyperparameters
    }

    dm_config = {
        "checkpoint": checkpoint,
        "batch-size": 20,
        "np-seed": 0,
        "validation-split": 0.2,
        "test-split": 0.0,
        "data-id": "main",
    }
    dm = DATAMODULE(dm_config)
    logger = TensorBoardLogger(save_dir=os.getcwd(), name=TB_LOGDIR)
    
    lightning_config = (
        LightningConfigBuilder()
        .module(cls=MODEL, config=config)
        .trainer(
            max_epochs=num_epochs, 
            accelerator=accelerator, 
            logger=logger, 
            enable_progress_bar=False,
            log_every_n_steps=1,
        )
        .fit_params(datamodule=dm)
        .checkpointing(monitor="Loss/avg_valid", save_top_k=2, mode="min")
        .build()
    )

    # define AIR checkpointconfig to properly save checkpoints in AIR format.
    run_config = RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=2,
            checkpoint_score_attribute="Loss/avg_valid",
            checkpoint_score_order="min",
        )
    )

    # decide at each iteration which trials perform badly and stop them.
    scheduler = ASHAScheduler(max_t=num_epochs, grace_period=1, reduction_factor=2)

    scaling_config = ScalingConfig(
        num_workers=1,
        use_gpu=True,
        resources_per_worker={"CPU": 1, "GPU": 1},
    )
    lightning_trainer = LightningTrainer(
        scaling_config=scaling_config, run_config=run_config,
    )
    
    tuner = tune.Tuner(
        lightning_trainer,
        param_space={"lightning_config": lightning_config},
        tune_config=tune.TuneConfig(
            metric="Loss/avg_valid",
            mode="min",
            # num_samples=10,
            scheduler=scheduler,
        ),
        run_config=air.RunConfig(
            name="tune_clip",
        ),
    )

    results = tuner.fit()
    best_result= results.get_best_result(metric="Loss/avg_valid", mode="min")

The MODEL class inherits from pytorch_lightning.LightningModule, and its on_validation_epoch_end(self) is overridden like so:

    def on_validation_epoch_end(self):
        avg_valid_loss = torch.stack(self.valid_loss_list).mean()
        # self.log("Loss/avg_valid", avg_valid_loss, prog_bar=False, on_epoch=True, sync_dist=True)
        self.log_dict({'Loss/avg_valid': avg_valid_loss, 'step': self.current_epoch+1.0})
        self.avg_valid_loss = avg_valid_loss
        self.valid_loss_list.clear()

I’m not very familiar with ray, but I suspect that because the logger is defined on the configuration level and passed into the trainer, it is fixed for every hyperparameter configuration and so overrides data from previous training sessions. If true however it doesn’t really tell me how to solve the problem. Could anyone help take a look? Thanks!

yunxuanx · June 22, 2023, 11:44pm

Hi @kper22020, yeah the logger might write to the same directory from multiple runs.

Actually, Ray Tuner already logged the reported metrics with ray.tune.logger.TBXLoggerCallback. You can find the log artifact from the trial folder. e.g. {experiment_name}/LightningTrainer_abcde_*/events.out.tfevents.1687466006.g-e7ca70940d85f0001

If you still want to use Lightning’s TensorboardLogger, a workaround is to set save_dir as a relative path. Then you can find it in each trial’s artifact folder. Like {experiment_name}/LightningTrainer_abcde_*/rank_0/{save_dir}

kper22020 · June 23, 2023, 11:53pm

@yunxuanx Thanks, I wasn’t aware of this, I will take a look at the callback and see if I can get it to work.

Topic		Replies	Views
Hparam tensorboard logging with pytorch? Ray Tune	2	1135	March 2, 2021
Ray tune log hyperparameters to tensorboard (solved) Ray Tune	1	611	November 26, 2020
Could not find best trial. Did you pass the correct `metric` parameter? Ray Tune	3	1436	December 17, 2021
Tensorboard in RLLIB Algo Config without using Tune RLlib	0	383	July 5, 2023
Tune.run is not writing tensorboard log file RLlib	4	808	August 24, 2021

Lightning+ray+tensorboard = only one log

Related topics