Best model based on Checkpoint not Last epoch

asalimih · January 2, 2021, 1:42am

Hi,
I am using ray tune with the PyTorch Lightning callback. I followed the instructions in the tutorial but encountered a problem. it seems that although checkpoints are being saved the best model is chosen based on the last epoch. that is the analysis.best_checkpoint is the checkpoint of the model which had the best score at last epoch so I guess the ray algorithm just considers the last epoch of each model in order to find the best model which is not appropriate for my model. am I doing anything wrong?

trainable = partial(
    train_for_tune,
    data_dir=DATA_DIR,
    num_epochs=EPOCH_NUM,
    num_gpus=1)

analysis = tune.run(
    trainable ,
    resources_per_trial={
        "cpu": 1,
        "gpu": 1
    },
    metric="val_ci",
    mode="max",
    config=config,
    num_samples=4,
    keep_checkpoints_num=1,
    checkpoint_score_attr="val_ci",
    name="tune_test3",
    local_dir='rayres',verbose=1)

rliaw · January 2, 2021, 5:46am

asalimih:

I am using ray tune with the PyTorch Lightning callback. I followed the instructions in the tutorial but encountered a problem. it seems that although checkpoints are being saved the best model is chosen based on the last epoch. that is the analysis.best_checkpoint is the checkpoint of the model which had the best score at last epoch so I guess the ray algorithm just considers the last epoch of each model in order to find the best model which is not appropriate for my model. am I doing anything wrong?

Hmm, can you provide a bit more context about your bug? Stack traces/logs of the errors would be helpful.

It seems like you’ve set everything correctly, and that the checkpoints should be saved according to the top “val_ci”.

asalimih · January 2, 2021, 11:25am

@rliaw there isn’t any error and checkpoint is being saved. my problem (or misunderstanding) is that while trying different hyperparameters I want ray to evaluate models based on their checkpoint not the last epoch.
for example here I ran a small experiment:

Number of trials: 4/4 (4 TERMINATED)
== Status ==
Memory usage on this node: 4.1/47.1 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/27.39 GiB heap, 0.0/9.42 GiB objects (0/1.0 accelerator_type:GTX)
Current best trial: ab6bc_00002 with val_ci=0.5972576141357422 and parameters={'h_size': 8, 'rec_lmb': 0.98, 'k': 30, 'lr': 0.001, 'l2': 5e-05, 'bins': 10}
Result logdir: ***
Number of trials: 4/4 (4 TERMINATED)
+---------------------+------------+-------+--------+----------+--------+------------------+------------+----------+
| Trial name          | status     | loc   |   bins |   h_size |   iter |   total time (s) |   val_loss |   val_ci |
|---------------------+------------+-------+--------+----------+--------+------------------+------------+----------|
| DEFAULT_ab6bc_00000 | TERMINATED |       |     10 |       16 |   1911 |          61.6466 |   0.993486 | 0.594613 |
| DEFAULT_ab6bc_00001 | TERMINATED |       |     20 |       16 |   1830 |          61.8445 |   0.982069 | 0.56376  |
| DEFAULT_ab6bc_00002 | TERMINATED |       |     10 |        8 |   3030 |          95.5939 |   0.999031 | 0.597258 |
| DEFAULT_ab6bc_00003 | TERMINATED |       |     10 |        8 |   2207 |          74.0899 |   1.0091   | 0.556271 |
+---------------------+------------+-------+--------+----------+--------+------------------+------------+----------+
2021-01-02 14:39:53,452 INFO tune.py:444 -- Total run time: 160.61 seconds (158.75 seconds for the tuning loop).
{'h_size': 8, 'rec_lmb': 0.98, 'k': 30, 'lr': 0.001, 'l2': 5e-05, 'bins': 10}
/home/asalimi/projs/miR_Cox-PASNet/fscox/rayres/tune_test3/DEFAULT_ab6bc_00002_2_bins=10,h_size=8_2021-01-02_14-38-17/checkpoint_156/

the numbers in the table show metrics at last epoch (the number of iterations are different because of early stopping), as you can see at last line the analysis.best_checkpoint is for 00002. however when I look at the logs I see that 00000 at epoch 20 had the highest val_ci compared to others so I expect analysis.best_checkpoint be for 00000.
also I can see that for each trial two checkpoints are saved. one for the best best epoch one the last epoch (although i set checkpoint_at_end to False). and if I set chekpoint_at_end to True the following error occurs:
ValueError: 'checkpoint_at_end' cannot be used with a checkpointable function. You can specify and register checkpoints within your trainable function.
Here is my trainable function:

def train_for_tune(config, checkpoint_dir=None, data_dir=None, num_epochs=EPOCH_NUM, num_gpus=1):
	# DataModule
	dm = ...

	# Model
	if checkpoint_dir:
		# Get Trained Model File
		ckpt_file = os.path.join(checkpoint_dir, "checkpoint") 
		# Load Model (just for params)
		model = Model.load_from_checkpoint(
			checkpoint_path=ckpt_file,
			input_dim=tcga_dm.n_genes,
			u=tcga_dm.u_train
		)
	else:
		model = Model(...)

	#-----------Early Stopping------
	early_stop_callback = EarlyStopping(
		monitor='val_loss',
		min_delta=0.001,
		patience=1000,
		verbose=False,
		mode='min'
	)

	metrics = {"val_loss": "val_loss", "val_ci": "val_ci"}
	tune_reporter = TuneReportCheckpointCallback(metrics, on="validation_end",filename="checkpoint")
	trainer = Trainer(
			max_epochs=num_epochs,
			gpus=num_gpus,
			callbacks = [tune_reporter, early_stop_callback],
			logger = TensorBoardLogger(save_dir=tune.get_trial_dir(), name="", version="."),
			progress_bar_refresh_rate=0,
			log_every_n_steps=1,
			num_sanity_val_steps=0
		)

	trainer.fit(model, dm)

I noticed that the checkpoint_dir if statement never happens.

Thanks in advance

asalimih · January 25, 2021, 3:03pm

@rliaw don’t you have any Ideas?

rliaw · January 26, 2021, 2:09am

Hey @asalimih, thanks for following up. A couple pointers:

You don’t need to set checkpoint_at_end.
best_checkpoint gets the most recent checkpoint. Instead, you might want to try using analysis.get_best_checkpoint(metric, mode="max") which allows you to sort by a particular metric (as opposed to just the last checkpoint)

asalimih · February 17, 2021, 2:59pm

@rilaw thanks for the reply.
the get_best_checkpoint gets a trial in input and returns the path of the best checkpoint of that trial. the reason that I chose ray-tune instead of a simple grid search is that I can explore much bigger hyperparameter space efficiently. However I expect ray-tune to evaluate models based on the performance in checkpoint while exploring the space. is it possible?

rliaw · February 17, 2021, 6:20pm

I expect ray-tune to evaluate models based on the performance in checkpoint while exploring the space .

Hmm, I’m not sure what you mean by this. If you’re doing something like Bayesian optimization, Ray Tune will use the performance of your model to explore the space. You’re allowed to arbitrarily specify what metric you use to measure model performance (i.e., it can be a validation score, it can be a most recent checkpoint loaded into memory, etc).

Does that help?

asalimih · February 17, 2021, 8:26pm

So as you said Ray Tune will use the performance of my model to explore the space. but which performance. at checkpoint or last epoch? what should I do in order to make it be the checkpoint?
You mean that I should make another metric inside LightningModule which holds the performance of the model at last checkpoint?

bpg · April 21, 2021, 11:10pm

@rliaw I have the same question. It seems like ray.tune is using the last epoch during training to measure the performance of a trial. Instead, is there a way to tell it to use the best epoch of each trial?
I understand I can use analysis.get_best_checkpoint(metric, mode="max") to obtain that after tuning is done, but what about the logs or result tables that are printed during tuning - is it possible to do anything there?

rliaw · April 21, 2021, 11:24pm

Can you create your own metric that outputs the best seen metric (i.e. max_accuracy) and provide that via tune.report?

asalimih · April 24, 2021, 8:20pm

@bpg back then I used this solution and it worked as expected.

Topic		Replies	Views
Saving best model at the end of the training Ray Tune	4	3561	June 28, 2024
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	501	October 18, 2021
Lightning- Early Stopping of training in Tune Ray Tune stopping condition & comparisons	3	1100	December 7, 2022
Saving checkpoints with good custom_metric using tune.run() Ray Tune	18	2303	July 20, 2021
Ray restore checkpoint in rllib RLlib	6	1651	August 11, 2021

Best model based on Checkpoint not Last epoch

Related topics