This is my first project using Ray 1.6.0 and ray.tune
Description:
I am performing a network architecture search with ray tune.
The call to run is given the parameters listed below.
search_alg = HyperOptSearch()
if params.restore:
search_alg.restore_from_dir(os.join(params.local_dir, params.exp_name))
search_alg = ConcurrencyLimiter(search_alg, max_concurrent=10)
analysis = tune.run(
train,
name=params.exp_name, # f"{UID}"
config=config,
search_alg=search_alg,
num_samples=200, # changed to 400 if needed
metric="Fitness",
mode="max",
local_dir=params.local_dir,
log_to_file=True,
resources_per_trial={'cpu': 4, 'gpu': 1},
resume=True,
stop=tune.stopper.MaximumIterationStopper(1))
train is a function accepting config sample from HyperOptSearch as a parameter. It instantiates and trains a neural network (TensorFlow 2) and returns upon completion of the training. It does not relinquish control at any point during this training. The return from the function is as follows:
return {'Loss': values['loss'], 'Fitness': fitness, 'EarlyStop': callbacks[1].stopped_epoch}
Problem:
I am running ray on a cluster of machine (24 hours max time for a session) by starting ray cluster manually using a script and performing a tuning experiment. Given my current resources I can run 200 trials successfully within the 24hr time limit before the session is terminated on the cluster. I want to run a much larger experiment, e.g., 1000 trials but I have to break it up into five 200 trial sessions and continue from previously recorded results.
The trial results stored in ~/results/nas/ and the experiment name is exp-ur-extended. Currently I have 200 subdirectories each corresponding to a trial and its results.
Is it possible to:
- Load the state of HyperOptSearch from the files located in ~/results/nas/exp-ur-extended
- Generate trial 201 through 400 and record the new trials in ~/results/nas/exp-ur-extended
Currently the code runs but immediately terminates upon instantiation of ray cluster as if it has nothing to do. Any suggestions on how to proceed?