PB2 seems stuck in space margins and raises exceptions with lambdas

As of title, both @Peter_Pirog and I have encountered this bug where the PB2 remains fixed on the minimum and maximum values of the given hyper-parameters space. When providing a lambda, it crashes saying that it must be a tuple with length 2.

cc @amogkam who helped making the PB2 integration happen

Perfect! I will prepare an example on Colab to easily reproduce this peculiar behaviour.

Here is the Colab reproducing the aforementioned error: https://colab.research.google.com/drive/1rnSJr2r-hCuDHTyeqOS9fM9J4kF4r7PA?usp=sharing

Hi @LucaCappelletti94, thanks for sharing the Colab notebook.

Since no config parameter was passed to tune.run, the hyperparam_bounds is used for the initial search space, so all the x values for all the trials will either be 1 or 100. And the reason why these values never change is because no perturbations are actually occurring.
image

This is because only each trial only runs 1 iteration. Instead, you need to add a loop to your loss function so that multiple training iterations are run:

def loss(config):
    for _ in range(100):
        tune.report(loss=config.get("x")**2)

And then, you also have to reduce the value of your perturbation_interval. As it’s set up now, no perturbations occur until the trial runs for 1000 seconds. Instead you should lower this number, or change the setup to use iterations as the interval instead:

time_attr='training_iteration',
perturbation_interval=2,

Ok I see, is there an example on how to use this perturbation on something like a Sklearn or Keras model? I am trying to picture what models that I am usually familiar with may benefit from this particular optimization approach.

Thanks!

I have tried the following but still, it keeps jumping back and forth from -100 to 100. What should I change?

@LucaCappelletti94 ah yes for PB2 you also need to specify an initial hyperparameter search space by passing in a config to tune.run, per haps something like:

tune.run(
  loss,
  ...,
  config={
    'x': tune.uniform(-100, 100)
})

If you don’t specify this, then the initial hyperparameter values for all trials will be either -100 or 100, and since PB2 uses the information from previous trials to inform future hyperparameters, all future trials will be at these bound values as well. Try this out, and let me know if it works for you!

Yes, you can use these algorithms with sklearn and keras. Tune will work with anything that can be specified in a training function. You can see here for an example of Keras: tune_mnist_keras — Ray v2.0.0.dev0

Ok I see, I will try this ASAP.

For the notes on the Keras model, I fail to see how the perturbation at the various training epochs may be applied. How would the PB2 method work there?

In the example, we add a TuneReportCallback to model.fit inside our training function. This will automatically report results to Tune after every training epoch. With this, we can use Keras with Tune and you can pass in any scheduler when you call tune.run, whether that’s PB2, PBT, ASHA, etc. These schedulers will work just like with any other training function.

I understand how the TuneReporter works, what I am not understanding is how the PB2 would update the parameters of the model as the epochs proceed. Would this be like ASHA? Is PB2 also a median-based early stopping mechanism?

PB2 is very similar to Population Based Training, it’s not an early stopping algorithm. Multiple trials are run in parallel, and at a certain interval, the bottom percentile of trials copy the hyperparameters and state (i.e. model weights) of the top percentile, slightly perturbs the hyperparameters and then continues training until the stopping condition is met.

The difference is that PBT uses a heuristic based approach to perturb the hyperparameters, while PB2 uses a Bayesian-style Gaussian model and leverages previous results to perturb the hyperparameters once they are copied from the top trials.

Population Based Training generally is a good HPO approach and has shown to perform well across a variety of domains (Image processing, NLP/transformer models, RL), and is also more resource efficient than other approaches.

PB2 is particularly beneficial for cases where you have to use a smaller population size, perhaps due to resource constraints.

Does this help answer your question?

For more info you can also check out:
The PB2 documentation
PB2 Blog Post

And for more in-depth info for context on standard Population Based Training and the motivation on why we need PB2:
This Anyscale Connect Talk

I don’t understand how the tuning of something like the number of layers of a model can be tuned if the weights are kept as a different number of layers would imply a different shape for the weights. Also, how are the weights kept? Is the checkpointing system automatic?

Your trainable function will have to take care of checkpoint saving and restoring, but that’s no different to other trainables.

As for number of layers, this is no problem as long as the number of layers remains constant within a trial (i.e. they can’t be mutated). So let’s say you have 4 trials A, B, C and D. A to C use 2 layers each and D uses 3 layers. Trial A performs badly and should be stopped. Trial D currently performs best. Then Trial A will copy all hyperparameters from Trial D (including the number of layers, 3) and restores from the latest checkpoint of D (which has weights for all three layers). It also perturbs some of the hyperparameters, e.g. the learning rate or so, and then continues training.

So in a way it does stop bad performing trials early, but the resources are then used to continue training on a modified copy of a well performing trial. Does this help?