Continue training for successful ray tune candidates

What I’ve done so far: I have run ray tune with 100 trials and a low number of epochs. Ray tune has now concluded. Let’s say the best hyperparameter setting was A. The data on A is now in ~/ray-experiments/Trainable/A/.

What I want to do next (but don’t know how): Take A and continue training for another n epochs. The information of the next epochs should be added/appended to the same directory (~/ray-experiments/Trainable/A/).

What I don’t want: Starting over with training or giving it the “appearance” of a new trial (because that would confuse weights & measures or would require manual work).

Context: I’ve been running hyper parameter optimization over a model with many hyperparameters. I wanted to identify promising parts of the parameter space based on the performance in the first epochs. Now I want to continue training for the promising parts of the parameter space.

I know how to simply restore the last checkpoint or train the model, but it’s this integration in the same directory/data structure with the existing tune runs that I cannot figure out.

hi @kemok , I briefly chatted with @justinvyu and @xwjiang2010 , my understanding is there’s no built-in support for it yet, but can be worked around by just loading the best config and starting a new Tune experiment.

This ask came up a few times in the past but we aren’t sure if it reaches the critical mass to prioritize. We’re in the process of designing Tuner.restore behavior and it will be great if you can provide context of your use case, key requirements and feedbacks to us.

Here’s an example of a workaround you can use for now. Since the trial with the good hyperparameter configuration was only run for a few epochs, it might be okay to restart new trials without restoring weights, although I do realize this is not the ideal behavior that you want. Doing this would also allow you to run multiple seeds using the hyperparam configuration of A. See code below:

# Original run of 100 trials for few epochs
# the original param_space might specify a large grid search
tuner = tune.Tuner(..., param_space={...})
result_grid = tuner.fit()
best_result = result_grid.get_best_result(
    metric=<your metric>,
    mode=<"max" or "min">
)

# New run with N samples with the best hyperparameter configuration
new_tuner = tune.Tuner(
    ...,
    param_space=best_result.config,
    tune_config=tune.TuneConfig(num_samples=N)
)
new_result_grid = new_tuner.fit()

Another option that might be useful to you is using Population Based Training, which is a hyperparameter search algorithm that uses the concept of “continue training for the promising parts”.

Thank you @Jiao_Dong @justinvyu ! I very much appreciate your quick help.

@justinvyu Yes, this is similar to what I am doing at the moment in order to train longer and still have it register as a trial (I’m using the wandb callbacks of ray tune to keep track of runs in the browser).

Starting fresh rather than continuing isn’t optimal, of course. Being able to simply call restore on a trial and continue would be great.

However, I also since noticed that contrary to what I wrote when opening this thread, I don’t really care whether it shows up as a new trial or not. In fact, it might be better to make it into a new trial after all.

Just to explain why I was interested in not creating a new trial in the beginning

My concern was not confusing the search algorithm (optuna) by having two trials with the same hyperparameters giving different results. But actually this concern is nonsense, because if I “manually” change the maximum number of epochs trained, this also makes the result non-comparable to the other results. So I need to keep these things apart anyway.

But yes, I guess that switching to PBT or to just have everything determined by early stopping (without having a maximum number of epochs to train for) is just the way to go.

Thanks again for your comments! :heart:

1 Like