Question - About tune stopping condition with PBT

Hi there !
I hope you’re having a great day.

I have some questions about population based training with ray, and more specifically the moment it stops to tune.

For more context, here’s a part of my code (I’m using tune.run which I know is the soon-to-be deprecated API):

         scheduler = PopulationBasedTraining(
             hyperparam_mutations=hyperparam_mutations,
             time_attr="training_iteration",
             metric="val_loss",
             mode="min",
             perturbation_interval=1,
         )
 
         reporter = CLIReporter(
             parameter_columns=parameters_to_display,
             metric_columns=["val_loss", "epoch"],
             metric="val_loss",
             mode="min"
         )
 
         results = tune.run(
             partial(training_func),
             config=config,
             scheduler=scheduler,
             num_samples=2,
             progress_reporter=reporter,
             checkpoint_score_attr="training_iteration",
             keep_checkpoints_num=1,
             name=experience,
             checkpoint_at_end=True,
             local_dir="./ray_results",
             log_to_file=True, 
             resources_per_trial={"cpu": args.cpu, "gpu": args.gpu},
             resume="AUTO", 
             sync_config=tune.SyncConfig(syncer=None),
         )

Here, syncer is set to None as I’m using this code within SLURM cluster manager.

I’m using a function API (training_func) where I train over X number of epochs (and not an infinity loop). I use session.report and checkpointing at the end of each epoch (for registering info inside tensorboardX).

From my understanding, it seems that because I only train over X epochs, a trial is considered done when it reach the last epoch and it exit the training_func. When all trials (here 2) are finished, the run is finished and it stops tuning.
Is that correct ?

If so, is there any way to continue tuning (by maybe re-setting trials periodically) and only stop it after some conditions are met (with stop option) ? Should I use more trials (increase num_samples) ? If the solution is to use an infinity loop instead inside my training_func (with something like while True), is there a way to prevent trials to reach past a certain epoch (I need to compare results at a defined epoch) ? Or to retrieve best trial for a precise epoch ?

Thanks for your insight !

Yes, your understanding is correct. Could you try setting X to higher number?

Thanks @xwjiang2010 for your answer !

I can’t really modify the number of training epochs for a comparison purpose (I want to compare one model to another after the same number of epochs).
I’ll try to increase my number of trials to get better results.
Or is there a way to get the best trial for a defined epoch/training_iteration ?

Can we decouple tuning iterations and getting relevant results? For tuning, you can set a big X. And then can you use this API to get metrics for all trials and all iterations. And then you can do all comparisons you want.

1 Like

Thanks again for you answer and the API link =)

@Alana This guide may also be useful to you! Analyzing Tune Experiment Results — Ray 3.0.0.dev0

2 Likes

Thanks ! It’s indeed quite useful for comparisons =)