When to use ASHA

ASHA is a scheduler that stops training tasks than have sub-optimal performance when compared to the other available algorithms.

When should we use ASHA? Is there a particular subset of the available algorithms that work well with ASHA and others that do not? Should we use other schedulers in those cases?

Thanks!

@LucaCappelletti94 There is article with comparision of ASHA, PBT and PB2 schedulers:
Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits

The comparison is in the apendix.
I use ASHA when hyperparameters spce is not very big, but training is very long and I want terminate uneffective trials as soon as possible.

Thank you @Peter_Pirog!
What would be your definition of big hyper-parameters space?
What is your usual ASHA parameterization?
I was told by @rliaw to avoid ASHA when using HyperOpt, do you have any opinion on this?

@LucaCappelletti94 , of course definition of “big search space” is very relaitve.
Typicaly I use ASHA if I want to check all hyperparameters combination, it’s possible but it needs a lot time.
For example in supervising learning I want to check keras model with hundred combinations of dense layers units values, initializers, activations, dropout values etc. defined by ray.tune.grid_search, ASHA can check all these combination and treminate them as soon as possible, in default settings (reduction_factor=3) termination after 3,9,27,81 etc. iterations.

ASHA doesn’t transfer knowledge about hyperparameters from one trial to another, the trial is good and it’s continued or bad and it’s terminated.

In PB2 knowledge about good hyperparameters in one trial is transfered to the others so I think it’s good for ranges ray.tune.uniform, ray.tune.randint

This is only my experience with shcedulers (ray is new for me and I try to unerstand it) so ASHA works fine in these circumstances. I think @rliaw is more experienced than me so don’t ignore his hints :slightly_smiling_face:

Thanks! If I understood correctly, PB2 is in itself basically also a tuning algorithm and it does not play along with other algorithms. Is this correct?

Yes, ASHA is one of the tuning algorithms so You can choose only one of them.
https://docs.ray.io/en/master/tune/api_docs/schedulers.html?highlight=ASHA

Sorry, do you mean PB2 or ASHA as tuning algorithm? If I understood the schedulers correctly, ASHA is just an early stopping mechanism and does not provide any tuning by itself. PB2 instead, by introducing a mutation mechanism, does indeed provide tuning.

Have I misunderstood?

I use PB2 or ASHA , terminating uneffective combinations is some kind of tuning. In arxiv docs https://arxiv.org/pdf/2002.02518.pdf ASHA, PBT and PB2 are compared as alternative methods of hyperparameter tuning.

I see. In these scenarios, when using ASHA, are the hyper-parameter values just randomly sampled?

As I observed, for both algorithms at the begining parameters are choosen randomly, but durong training PB2 use bayesian conditional probabilities to select next set of parameters, so PB2 use experience from all previous trials.

There is nice tool in tensorboard to observe hyperparameters influence.

Thank you! I will ask another question on how to model integer values (eg units in a layer) using PB2 in another form.

Just a last question on ASHA: about the grace period, which values have you found work best? With what considerations?

I understand I am asking your opinion and the optimal value of these parameters needs to be properly tuned depending on the task.

Now I change my keras train script from ASHA to PB2. I will paste github link when finish.

1 Like

Thank you, I would love to see a good example of it! I’ve tried to use it but I must be doing something wrong, the hyper-parameters selected seem to be chosen exclusively from the space minimum or maximum.

Now I noticed the same problem, only maximum or minimum values.
In : github code is example:

Example:
>>> pb2 = PB2(
>>> time_attr=“timesteps_total”,
>>> metric=“episode_reward_mean”,
>>> mode=“max”,
>>> perturbation_interval=10000,
>>> hyperparam_mutations={
>>> # These must be continuous, currently a limitation.
>>> “factor_1”: lambda: random.uniform(0.0, 20.0),
>>> })

but using lambda makes error:

ValueError: hyperparam_bounds values must either be a list or tuple of size 2, but got <function lambda at 0x7fd426a4ce18> instead

Yes I am facing the very same issue. Possibly we should move this topic in another question to try and keep them atomic.

May I ask your opinion on my question just above on the grace period in ASHA?

Grace period is similar to patience in keras callbacks.

I found some suggestion hot to set it:
https://stackoverflow.com/questions/43906048/which-parameters-should-be-used-for-early-stopping

patience argument represents the number of epochs before stopping once your loss starts to increase (stops improving). This depends on your implementation, if you use very small batches or a large learning rate your loss zig-zag (accuracy will be more noisy) so better set a large patience argument. If you use large batches and a small learning rate your loss will be smoother so you can use a smaller patience argument. Either way I’ll leave it as 2 so I would give the model more chance

@rliaw I am using tune.run() with ASHA as the Trial Scheduler for early stopping, and HyperOpt is the searcher as follow:

algo = HyperOptSearch(points_to_evaluate=current_best_params, random_state_seed=self.seed)
algo = ConcurrencyLimiter(algo, max_concurrent=5)
analysis = tune.run(
HpoTrainable,
name=“hpo”,
metric=“loss”,
mode=“min”,
search_alg=algo,
scheduler=AsyncHyperBandScheduler(),
num_samples=10,
config=search_space,
stop={“training_iteration”: self.training_iteration},
)

I find that it is very slow (like three times slower without ASHA). Is it true that ASHA + HyperOpt is not a good combination?