Best model based on Checkpoint not Last epoch

Hi,
I am using ray tune with the PyTorch Lightning callback. I followed the instructions in the tutorial but encountered a problem. it seems that although checkpoints are being saved the best model is chosen based on the last epoch. that is the analysis.best_checkpoint is the checkpoint of the model which had the best score at last epoch so I guess the ray algorithm just considers the last epoch of each model in order to find the best model which is not appropriate for my model. am I doing anything wrong?

trainable = partial(
    train_for_tune,
    data_dir=DATA_DIR,
    num_epochs=EPOCH_NUM,
    num_gpus=1)

analysis = tune.run(
    trainable ,
    resources_per_trial={
        "cpu": 1,
        "gpu": 1
    },
    metric="val_ci",
    mode="max",
    config=config,
    num_samples=4,
    keep_checkpoints_num=1,
    checkpoint_score_attr="val_ci",
    name="tune_test3",
    local_dir='rayres',verbose=1)

Hmm, can you provide a bit more context about your bug? Stack traces/logs of the errors would be helpful.

It seems like you’ve set everything correctly, and that the checkpoints should be saved according to the top “val_ci”.

@rliaw there isn’t any error and checkpoint is being saved. my problem (or misunderstanding) is that while trying different hyperparameters I want ray to evaluate models based on their checkpoint not the last epoch.
for example here I ran a small experiment:

Number of trials: 4/4 (4 TERMINATED)
== Status ==
Memory usage on this node: 4.1/47.1 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/27.39 GiB heap, 0.0/9.42 GiB objects (0/1.0 accelerator_type:GTX)
Current best trial: ab6bc_00002 with val_ci=0.5972576141357422 and parameters={'h_size': 8, 'rec_lmb': 0.98, 'k': 30, 'lr': 0.001, 'l2': 5e-05, 'bins': 10}
Result logdir: ***
Number of trials: 4/4 (4 TERMINATED)
+---------------------+------------+-------+--------+----------+--------+------------------+------------+----------+
| Trial name          | status     | loc   |   bins |   h_size |   iter |   total time (s) |   val_loss |   val_ci |
|---------------------+------------+-------+--------+----------+--------+------------------+------------+----------|
| DEFAULT_ab6bc_00000 | TERMINATED |       |     10 |       16 |   1911 |          61.6466 |   0.993486 | 0.594613 |
| DEFAULT_ab6bc_00001 | TERMINATED |       |     20 |       16 |   1830 |          61.8445 |   0.982069 | 0.56376  |
| DEFAULT_ab6bc_00002 | TERMINATED |       |     10 |        8 |   3030 |          95.5939 |   0.999031 | 0.597258 |
| DEFAULT_ab6bc_00003 | TERMINATED |       |     10 |        8 |   2207 |          74.0899 |   1.0091   | 0.556271 |
+---------------------+------------+-------+--------+----------+--------+------------------+------------+----------+
2021-01-02 14:39:53,452 INFO tune.py:444 -- Total run time: 160.61 seconds (158.75 seconds for the tuning loop).
{'h_size': 8, 'rec_lmb': 0.98, 'k': 30, 'lr': 0.001, 'l2': 5e-05, 'bins': 10}
/home/asalimi/projs/miR_Cox-PASNet/fscox/rayres/tune_test3/DEFAULT_ab6bc_00002_2_bins=10,h_size=8_2021-01-02_14-38-17/checkpoint_156/

the numbers in the table show metrics at last epoch (the number of iterations are different because of early stopping), as you can see at last line the analysis.best_checkpoint is for 00002. however when I look at the logs I see that 00000 at epoch 20 had the highest val_ci compared to others so I expect analysis.best_checkpoint be for 00000.
also I can see that for each trial two checkpoints are saved. one for the best best epoch one the last epoch (although i set checkpoint_at_end to False). and if I set chekpoint_at_end to True the following error occurs:
ValueError: 'checkpoint_at_end' cannot be used with a checkpointable function. You can specify and register checkpoints within your trainable function.
Here is my trainable function:

def train_for_tune(config, checkpoint_dir=None, data_dir=None, num_epochs=EPOCH_NUM, num_gpus=1):
	# DataModule
	dm = ...

	# Model
	if checkpoint_dir:
		# Get Trained Model File
		ckpt_file = os.path.join(checkpoint_dir, "checkpoint") 
		# Load Model (just for params)
		model = Model.load_from_checkpoint(
			checkpoint_path=ckpt_file,
			input_dim=tcga_dm.n_genes,
			u=tcga_dm.u_train
		)
	else:
		model = Model(...)

	#-----------Early Stopping------
	early_stop_callback = EarlyStopping(
		monitor='val_loss',
		min_delta=0.001,
		patience=1000,
		verbose=False,
		mode='min'
	)

	metrics = {"val_loss": "val_loss", "val_ci": "val_ci"}
	tune_reporter = TuneReportCheckpointCallback(metrics, on="validation_end",filename="checkpoint")
	trainer = Trainer(
			max_epochs=num_epochs,
			gpus=num_gpus,
			callbacks = [tune_reporter, early_stop_callback],
			logger = TensorBoardLogger(save_dir=tune.get_trial_dir(), name="", version="."),
			progress_bar_refresh_rate=0,
			log_every_n_steps=1,
			num_sanity_val_steps=0
		)

	trainer.fit(model, dm)

I noticed that the checkpoint_dir if statement never happens.

Thanks in advance

@rliaw don’t you have any Ideas?

Hey @asalimih, thanks for following up. A couple pointers:

  1. You don’t need to set checkpoint_at_end.
  2. best_checkpoint gets the most recent checkpoint. Instead, you might want to try using analysis.get_best_checkpoint(metric, mode="max") which allows you to sort by a particular metric (as opposed to just the last checkpoint)

@rilaw thanks for the reply.
the get_best_checkpoint gets a trial in input and returns the path of the best checkpoint of that trial. the reason that I chose ray-tune instead of a simple grid search is that I can explore much bigger hyperparameter space efficiently. However I expect ray-tune to evaluate models based on the performance in checkpoint while exploring the space. is it possible?

I expect ray-tune to evaluate models based on the performance in checkpoint while exploring the space .

Hmm, I’m not sure what you mean by this. If you’re doing something like Bayesian optimization, Ray Tune will use the performance of your model to explore the space. You’re allowed to arbitrarily specify what metric you use to measure model performance (i.e., it can be a validation score, it can be a most recent checkpoint loaded into memory, etc).

Does that help?

So as you said Ray Tune will use the performance of my model to explore the space. but which performance. at checkpoint or last epoch? what should I do in order to make it be the checkpoint?
You mean that I should make another metric inside LightningModule which holds the performance of the model at last checkpoint?

@rliaw I have the same question. It seems like ray.tune is using the last epoch during training to measure the performance of a trial. Instead, is there a way to tell it to use the best epoch of each trial?
I understand I can use analysis.get_best_checkpoint(metric, mode="max") to obtain that after tuning is done, but what about the logs or result tables that are printed during tuning - is it possible to do anything there?

Can you create your own metric that outputs the best seen metric (i.e. max_accuracy) and provide that via tune.report?

@bpg back then I used this solution and it worked as expected.