It seems to me that when trying to restore a failed run utilizing a scheduler then using something like this:
tuner = tune.Tuner.restore(“…”)
results = tuner.fit()
When using num_samples bigger than the resources available the previously paused trials don’t restart when it should be their turn nor does the scheduler seem to be reactivated (here PB2).
Output of initial fit():
Above you can see the PBT algo (here PB2) running with checkpoints and perturbs.
However, when trying to restore a failed run (stopped by entering ctrl c) this seems to fail with output looking like this:
and only the running trial from before is restarted - not the ones that were paused when the “fail” occurred even though way more times steps were preformed beyond the paused ones.
Sample code for running this can be found here