Add additional trials to experiment

This is my first project using Ray 1.6.0 and ray.tune

Description:
I am performing a network architecture search with ray tune.
The call to run is given the parameters listed below.

search_alg = HyperOptSearch()
if params.restore:
    search_alg.restore_from_dir(os.join(params.local_dir, params.exp_name))

search_alg = ConcurrencyLimiter(search_alg, max_concurrent=10)

analysis = tune.run(
    train,
    name=params.exp_name,  # f"{UID}"
    config=config,
    search_alg=search_alg,
    num_samples=200,  # changed to 400 if needed
    metric="Fitness",
    mode="max",
    local_dir=params.local_dir,
    log_to_file=True,
    resources_per_trial={'cpu': 4, 'gpu': 1},
    resume=True,
    stop=tune.stopper.MaximumIterationStopper(1))

train is a function accepting config sample from HyperOptSearch as a parameter. It instantiates and trains a neural network (TensorFlow 2) and returns upon completion of the training. It does not relinquish control at any point during this training. The return from the function is as follows:

return {'Loss': values['loss'], 'Fitness': fitness,  'EarlyStop': callbacks[1].stopped_epoch}

Problem:
I am running ray on a cluster of machine (24 hours max time for a session) by starting ray cluster manually using a script and performing a tuning experiment. Given my current resources I can run 200 trials successfully within the 24hr time limit before the session is terminated on the cluster. I want to run a much larger experiment, e.g., 1000 trials but I have to break it up into five 200 trial sessions and continue from previously recorded results.
The trial results stored in ~/results/nas/ and the experiment name is exp-ur-extended. Currently I have 200 subdirectories each corresponding to a trial and its results.
Is it possible to:

  1. Load the state of HyperOptSearch from the files located in ~/results/nas/exp-ur-extended
  2. Generate trial 201 through 400 and record the new trials in ~/results/nas/exp-ur-extended

Currently the code runs but immediately terminates upon instantiation of ray cluster as if it has nothing to do. Any suggestions on how to proceed?

Hi,

the way you would usually go about this is to specify num_samples=1000 and resume=True (or resume="AUTO"). When you re=run the same script, Tune will pick up from the latest state and continue running the experiment.

There is currently no way to continue training and just add new trials. However, we’ll be refactoring part of the user interface in early 2022, so we may want to add this functionality eventually.

Thanks. I will try this method.

Unfortunately changing of the num_samples=1000 and resume=True doesn’t seem to produce any additional trials and the task still terminates immediately. There doesn’t seem to be any crashes.

Are there any facilities within ray or ray.tune that provide logging so that I may track the progress of scheduler and search algorithms?

You can instead try doing the following:

search_alg = HyperOptSearch()

experiment_1 = tune.run(
    trainable,
    search_alg=search_alg)

search_alg.save("./my-checkpoint.pkl")

# Restore the saved state onto another search algorithm

search_alg2 = HyperOptSearch()
search_alg2.restore("./my-checkpoint.pkl")

experiment_2 = tune.run(
    trainable,
    search_alg=search_alg2)

See the docs here!

https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#saving-and-restoring