Lightning- Early Stopping of training in Tune

Mike1 · November 29, 2022, 7:21pm

I have read this guide. In this guide, for each hyperparameter combination, it seems like Tune uses the metrics obtained by the network weights at the end of its training. However, I would like to use the network weights which yield the lowest validation score throughout training. For example, if the grid contains two hyperparameter combinations, and trains each of the two networks for 500 iterations, but the first network obtains the lowest validation score at iteration 70 and the second network obtains it at iteration 215, I want the grid search to compare the networks at their best points (iterations 70 and 215 respectively) instead of at iteration 500. For a single network, I know how to do that: use a ModelCheckpoint, and then use the best_model_path property. However, I don’t know how to make Tune do that. Can anyone help? Thank you!

Yard1 · November 29, 2022, 7:56pm

Hey @Mike1, you can achieve that by configuring checkpointing in Tune to keep the best checkpoint per trial according to a metric. You can do that through the keep_checkpoints_num and checkpoint_score_attr arguments in tune.run API, or the CheckpointConfig object in the new, recommended Tuner API (available from Ray>=2.0, you can see how to use it in the latest version of the documentation - Using PyTorch Lightning with Tune — Ray 2.1.0).

Using the example I linked, you’d specify the run_config argument as:

from ray.air.config import RunConfig, CheckpointConfig

        run_config=air.RunConfig(
            name="tune_mnist_asha",
            progress_reporter=reporter,
            checkpoint_config=CheckpointConfig(
                checkpoint_score_attribute="loss",
                checkpoint_score_order="min",
                # num_to_keep=1,  # optionally set to only keep the best checkpoint on disk/cloud
            ),
        ),

Then, when you access the checkpoints after the run through the checkpoint attribute (eg. results.get_best_result().checkpoint), you will receive the checkpoint taken at the iteration which minimized the loss.

Mike1 · December 7, 2022, 7:14pm

Thank you for the answer! However, I am not sure it does what I meant. It looks like results.get_best_result() still returns the network that got the best val loss at the end of training, not at the point where val loss was smallest, and the checkpoint returns the best val loss point for that network. For example: suppose I have two networks, net1 and net2, and:

loss(net1_at_end_of_training) < loss(net2_at_end_of_training)
loss(net1_at_best_point_during_training) > loss(net2_at_best_point_during_training)

it seems that your code returns net1_at_best_point_during_training, but I want something that returns net2_at_best_point_during_training. Any suggestions?

Yard1 · December 7, 2022, 7:27pm

Got it, thanks for clarifying! In that case, you want do to:
results.get_best_result(scope="all").checkpoint - by default, get_best_result will only consider the last reported metric, but you can change the scope to consider all reports. Then, checkpoint will return the best checkpoint associated with the result.

Topic		Replies	Views
Tuner.fit().get_best_result has no checkpoints (None) Ray Tune	4	614	August 26, 2024
Best model based on Checkpoint not Last epoch Ray Tune	10	1653	April 24, 2021
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	497	October 18, 2021
Which attributes can be used in `checkpoint_score_attr` when using `tune.run` RLlib	10	1211	April 20, 2022
Saving best model at the end of the training Ray Tune	4	3512	June 28, 2024

Lightning- Early Stopping of training in Tune

Related topics