Questions about tune stopping condition with PBT

Alana · February 21, 2023, 5:38pm

Hi there !
I hope you’re having a great day.

I have some questions about population based training with ray, and more specifically the moment it stops to tune.

For more context, here’s a part of my code (I’m using tune.run which I know is the soon-to-be deprecated API):

       scheduler = PopulationBasedTraining(
         hyperparam_mutations=hyperparam_mutations,
             time_attr="training_iteration",
             metric="val_loss",
             mode="min",
             perturbation_interval=1,
         )
 
         reporter = CLIReporter(
             parameter_columns=parameters_to_display,
             metric_columns=["val_loss", "epoch"],
             metric="val_loss",
             mode="min"
         )
 
         results = tune.run(
             partial(training_func),
             config=config,
             scheduler=scheduler,
             num_samples=2,
             progress_reporter=reporter,
             checkpoint_score_attr="training_iteration",
             keep_checkpoints_num=1,
             name=experience,
             checkpoint_at_end=True,
             local_dir="./ray_results",
             log_to_file=True, 
             resources_per_trial={"cpu": args.cpu, "gpu": args.gpu},
             resume="AUTO", 
             sync_config=tune.SyncConfig(syncer=None),
         )

Here, syncer is set to None as I’m using this code within SLURM cluster manager.

I’m using a function API (training_func) where I train over X number of epochs (and not an infinity loop). I use session.report and checkpointing at the end of each epoch (for registering info inside tensorboardX).

From my understanding, it seems that because I only train over X epochs, a trial is considered done when it reach the last epoch and it exit the training_func. When all trials (here 2) are finished, the run is finished and it stops tuning.
Is that correct ?

If so, is there any way to continue tuning (by maybe re-setting trials periodically) and only stop it after some conditions are met (with stop option) ? Should I use more trials (increase num_samples) ? If the solution is to use an infinity loop instead inside my training_func (with something like while True), is there a way to prevent trials to reach past a certain epoch (I need to compare results at a defined epoch) ? Or to retrieve best trial for a precise epoch ?

Thanks for your insight !

Alana · February 27, 2023, 2:06pm

Please, ignore this post. It’s a duplicates of Question - About tune stopping condition with PBT

Topic		Replies	Views
Question - About tune stopping condition with PBT	6	502	February 21, 2023
Trouble with some results from Ray Tune	1	42	August 7, 2024
[Tune PBT] Population Based Training :: Questions & Errors Ray Tune	3	1180	April 1, 2021
[PBT] Population-based Training early kill? Ray Tune	5	361	July 12, 2021
Ray Tune x SLURM - Problem with checkpoints	5	384	March 15, 2023

Questions about tune stopping condition with PBT

Related topics