Hi there !
I hope you’re having a great day.
I have some questions about population based training with ray, and more specifically the moment it stops to tune.
For more context, here’s a part of my code (I’m using tune.run which I know is the soon-to-be deprecated API):
scheduler = PopulationBasedTraining(
hyperparam_mutations=hyperparam_mutations,
time_attr="training_iteration",
metric="val_loss",
mode="min",
perturbation_interval=1,
)
reporter = CLIReporter(
parameter_columns=parameters_to_display,
metric_columns=["val_loss", "epoch"],
metric="val_loss",
mode="min"
)
results = tune.run(
partial(training_func),
config=config,
scheduler=scheduler,
num_samples=2,
progress_reporter=reporter,
checkpoint_score_attr="training_iteration",
keep_checkpoints_num=1,
name=experience,
checkpoint_at_end=True,
local_dir="./ray_results",
log_to_file=True,
resources_per_trial={"cpu": args.cpu, "gpu": args.gpu},
resume="AUTO",
sync_config=tune.SyncConfig(syncer=None),
)
Here, syncer is set to None as I’m using this code within SLURM cluster manager.
I’m using a function API (training_func) where I train over X number of epochs (and not an infinity loop). I use session.report and checkpointing at the end of each epoch (for registering info inside tensorboardX).
From my understanding, it seems that because I only train over X epochs, a trial is considered done when it reach the last epoch and it exit the training_func. When all trials (here 2) are finished, the run is finished and it stops tuning.
Is that correct ?
If so, is there any way to continue tuning (by maybe re-setting trials periodically) and only stop it after some conditions are met (with stop option) ? Should I use more trials (increase num_samples) ? If the solution is to use an infinity loop instead inside my training_func (with something like while True), is there a way to prevent trials to reach past a certain epoch (I need to compare results at a defined epoch) ? Or to retrieve best trial for a precise epoch ?
Thanks for your insight !