Hello everyone.
I’m looking to optimize a couple of parameters of a given function using Tune.
This function uses some files and can’t be parallelized. However I have coded the trainable in a way that allows to interleave the trials (using checkpoints) between file usages.
Overall, the training is working in the way I thought would work (1 trial opens a file and runs a “iteration”, next trial runs another “iteration” on the same file until all trials have ran 1 “iteration” on an open file, then the 2nd file is opened and the cycle repeats).
However, when using some search algorithms, usually the trials get terminated after 1 or 2 iterations, when in reality a full training should be about 30 iterations.
Any tips on how to get this to work? I’ll share some snippets of my code as well as my tune config
def trainable(config, checkpoint_dir=None):
if checkpoint_dir:
#here I load the current open file
#here I execute my code and produce some metrics for Ray
with tune.checkpoint_dir(step=current_file) as checkpoint_dir:
#here I dump some variables
yield {"metric": metric}
And here is my tune.run config
bohb_hyperband = HyperBandForBOHB(
time_attr="training_iteration",
max_t=100,
reduction_factor=4,
metric="metric",
mode="max")
bohb_search = TuneBOHB(
max_concurrent=1,
metric="metric",
mode="max")
return tune.run(trainable,
config= # 2 uniform search spaces,
name="bohb_search",
scheduler=bohb_hyperband,
stop={"training_iteration": 100},
search_alg=bohb_search,
num_samples=10,
resources_per_trial={"cpu": 15} # I use this to limit trial execution to 1 at a time
)
To note that each iteration also takes some time (20/40 seconds) and I’m unsure if that is affecting the training process.
Any help is appreciated!