Add additional trials to experiment

MYousefi · December 3, 2021, 7:55pm

This is my first project using Ray 1.6.0 and ray.tune

Description:
I am performing a network architecture search with ray tune.
The call to run is given the parameters listed below.

search_alg = HyperOptSearch()
if params.restore:
    search_alg.restore_from_dir(os.join(params.local_dir, params.exp_name))

search_alg = ConcurrencyLimiter(search_alg, max_concurrent=10)

analysis = tune.run(
    train,
    name=params.exp_name,  # f"{UID}"
    config=config,
    search_alg=search_alg,
    num_samples=200,  # changed to 400 if needed
    metric="Fitness",
    mode="max",
    local_dir=params.local_dir,
    log_to_file=True,
    resources_per_trial={'cpu': 4, 'gpu': 1},
    resume=True,
    stop=tune.stopper.MaximumIterationStopper(1))

train is a function accepting config sample from HyperOptSearch as a parameter. It instantiates and trains a neural network (TensorFlow 2) and returns upon completion of the training. It does not relinquish control at any point during this training. The return from the function is as follows:

return {'Loss': values['loss'], 'Fitness': fitness,  'EarlyStop': callbacks[1].stopped_epoch}

Problem:
I am running ray on a cluster of machine (24 hours max time for a session) by starting ray cluster manually using a script and performing a tuning experiment. Given my current resources I can run 200 trials successfully within the 24hr time limit before the session is terminated on the cluster. I want to run a much larger experiment, e.g., 1000 trials but I have to break it up into five 200 trial sessions and continue from previously recorded results.
The trial results stored in ~/results/nas/ and the experiment name is exp-ur-extended. Currently I have 200 subdirectories each corresponding to a trial and its results.
Is it possible to:

Load the state of HyperOptSearch from the files located in ~/results/nas/exp-ur-extended
Generate trial 201 through 400 and record the new trials in ~/results/nas/exp-ur-extended

Currently the code runs but immediately terminates upon instantiation of ray cluster as if it has nothing to do. Any suggestions on how to proceed?

kai · December 7, 2021, 11:50am

Hi,

the way you would usually go about this is to specify num_samples=1000 and resume=True (or resume="AUTO"). When you re=run the same script, Tune will pick up from the latest state and continue running the experiment.

There is currently no way to continue training and just add new trials. However, we’ll be refactoring part of the user interface in early 2022, so we may want to add this functionality eventually.

MYousefi · December 7, 2021, 11:40pm

Thanks. I will try this method.

MYousefi · December 15, 2021, 5:56pm

Unfortunately changing of the num_samples=1000 and resume=True doesn’t seem to produce any additional trials and the task still terminates immediately. There doesn’t seem to be any crashes.

Are there any facilities within ray or ray.tune that provide logging so that I may track the progress of scheduler and search algorithms?

rliaw · January 7, 2022, 6:27am

You can instead try doing the following:

search_alg = HyperOptSearch()

experiment_1 = tune.run(
    trainable,
    search_alg=search_alg)

search_alg.save("./my-checkpoint.pkl")

# Restore the saved state onto another search algorithm

search_alg2 = HyperOptSearch()
search_alg2.restore("./my-checkpoint.pkl")

experiment_2 = tune.run(
    trainable,
    search_alg=search_alg2)

See the docs here!

https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#saving-and-restoring

rjs24 · May 3, 2025, 9:07pm

Hello, I know this is an old thread but: I did something like this, and wanted to see if I implemented It correctly by checking search_alg._hpopt_trials.trials at the end of experiment_2. In my analogous project, after running a second experiment, search_alg._hpopt_trials.trials is the same # of runs as experiment_2. Not experiment_1 + experiment_2. Is this expected behavior? Or do you think I am implementing something incorrectly?

I am using Ray 2.2.0

Topic		Replies	Views
Saving and Restoring Ray run confusion Checkpointing, Restoring	1	14	May 5, 2025
Resuming tune optimization from previously explored configurations	2	951	October 3, 2023
Continue training for successful ray tune candidates	3	882	October 7, 2022
Correct way of using tuner.restore() Ray Tune	6	2263	November 16, 2022
Skopt `resume` incorrectly finished when trying to run more `num_samples` Ray Tune	1	254	May 21, 2021

Add additional trials to experiment

Related topics