When to use ASHA

LucaCappelletti94 · January 15, 2021, 7:03pm

ASHA is a scheduler that stops training tasks than have sub-optimal performance when compared to the other available algorithms.

When should we use ASHA? Is there a particular subset of the available algorithms that work well with ASHA and others that do not? Should we use other schedulers in those cases?

Thanks!

Peter_Pirog · January 15, 2021, 9:24pm

@LucaCappelletti94 There is article with comparision of ASHA, PBT and PB2 schedulers:
Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits

The comparison is in the apendix.
I use ASHA when hyperparameters spce is not very big, but training is very long and I want terminate uneffective trials as soon as possible.

LucaCappelletti94 · January 16, 2021, 8:22am

Thank you @Peter_Pirog!
What would be your definition of big hyper-parameters space?
What is your usual ASHA parameterization?
I was told by @rliaw to avoid ASHA when using HyperOpt, do you have any opinion on this?

Peter_Pirog · January 16, 2021, 9:49am

@LucaCappelletti94 , of course definition of “big search space” is very relaitve.
Typicaly I use ASHA if I want to check all hyperparameters combination, it’s possible but it needs a lot time.
For example in supervising learning I want to check keras model with hundred combinations of dense layers units values, initializers, activations, dropout values etc. defined by ray.tune.grid_search, ASHA can check all these combination and treminate them as soon as possible, in default settings (reduction_factor=3) termination after 3,9,27,81 etc. iterations.

ASHA doesn’t transfer knowledge about hyperparameters from one trial to another, the trial is good and it’s continued or bad and it’s terminated.

In PB2 knowledge about good hyperparameters in one trial is transfered to the others so I think it’s good for ranges ray.tune.uniform, ray.tune.randint

This is only my experience with shcedulers (ray is new for me and I try to unerstand it) so ASHA works fine in these circumstances. I think @rliaw is more experienced than me so don’t ignore his hints

LucaCappelletti94 · January 16, 2021, 9:53am

Thanks! If I understood correctly, PB2 is in itself basically also a tuning algorithm and it does not play along with other algorithms. Is this correct?

Peter_Pirog · January 16, 2021, 10:06am

Yes, ASHA is one of the tuning algorithms so You can choose only one of them.
https://docs.ray.io/en/master/tune/api_docs/schedulers.html?highlight=ASHA

LucaCappelletti94 · January 16, 2021, 10:10am

Sorry, do you mean PB2 or ASHA as tuning algorithm? If I understood the schedulers correctly, ASHA is just an early stopping mechanism and does not provide any tuning by itself. PB2 instead, by introducing a mutation mechanism, does indeed provide tuning.

Have I misunderstood?

Peter_Pirog · January 16, 2021, 11:10am

I use PB2 or ASHA , terminating uneffective combinations is some kind of tuning. In arxiv docs https://arxiv.org/pdf/2002.02518.pdf ASHA, PBT and PB2 are compared as alternative methods of hyperparameter tuning.

LucaCappelletti94 · January 16, 2021, 11:14am

I see. In these scenarios, when using ASHA, are the hyper-parameter values just randomly sampled?

Peter_Pirog · January 16, 2021, 11:29am

As I observed, for both algorithms at the begining parameters are choosen randomly, but durong training PB2 use bayesian conditional probabilities to select next set of parameters, so PB2 use experience from all previous trials.

There is nice tool in tensorboard to observe hyperparameters influence.

LucaCappelletti94 · January 16, 2021, 12:13pm

Thank you! I will ask another question on how to model integer values (eg units in a layer) using PB2 in another form.

LucaCappelletti94 · January 16, 2021, 12:16pm

Just a last question on ASHA: about the grace period, which values have you found work best? With what considerations?

I understand I am asking your opinion and the optimal value of these parameters needs to be properly tuned depending on the task.

Peter_Pirog · January 16, 2021, 12:16pm

Now I change my keras train script from ASHA to PB2. I will paste github link when finish.

LucaCappelletti94 · January 16, 2021, 2:51pm

Thank you, I would love to see a good example of it! I’ve tried to use it but I must be doing something wrong, the hyper-parameters selected seem to be chosen exclusively from the space minimum or maximum.

Peter_Pirog · January 16, 2021, 3:51pm

Now I noticed the same problem, only maximum or minimum values.
In : github code is example:

Example:
>>> pb2 = PB2(
>>> time_attr=“timesteps_total”,
>>> metric=“episode_reward_mean”,
>>> mode=“max”,
>>> perturbation_interval=10000,
>>> hyperparam_mutations={
>>> # These must be continuous, currently a limitation.
>>> “factor_1”: lambda: random.uniform(0.0, 20.0),
>>> })

but using lambda makes error:

ValueError: hyperparam_bounds values must either be a list or tuple of size 2, but got <function lambda at 0x7fd426a4ce18> instead

LucaCappelletti94 · January 16, 2021, 3:55pm

Yes I am facing the very same issue. Possibly we should move this topic in another question to try and keep them atomic.

May I ask your opinion on my question just above on the grace period in ASHA?

Peter_Pirog · January 16, 2021, 4:04pm

Grace period is similar to patience in keras callbacks.

I found some suggestion hot to set it:
https://stackoverflow.com/questions/43906048/which-parameters-should-be-used-for-early-stopping

patience argument represents the number of epochs before stopping once your loss starts to increase (stops improving). This depends on your implementation, if you use very small batches or a large learning rate your loss zig-zag (accuracy will be more noisy) so better set a large patience argument. If you use large batches and a small learning rate your loss will be smoother so you can use a smaller patience argument. Either way I’ll leave it as 2 so I would give the model more chance

Ban_Mua · November 16, 2021, 1:49am

@rliaw I am using tune.run() with ASHA as the Trial Scheduler for early stopping, and HyperOpt is the searcher as follow:

algo = HyperOptSearch(points_to_evaluate=current_best_params, random_state_seed=self.seed)
algo = ConcurrencyLimiter(algo, max_concurrent=5)
analysis = tune.run(
HpoTrainable,
name=“hpo”,
metric=“loss”,
mode=“min”,
search_alg=algo,
scheduler=AsyncHyperBandScheduler(),
num_samples=10,
config=search_space,
stop={“training_iteration”: self.training_iteration},
)

I find that it is very slow (like three times slower without ASHA). Is it true that ASHA + HyperOpt is not a good combination?

Topic		Replies	Views
What is the relationship between "trial scheduler" and "search algorithm"? Ray Tune	3	676	February 24, 2022
Early stopping rules for ASHAScheduler Ray Tune	4	808	May 31, 2022
ASHA scheduler stops runs just after the grace period even though there are no other runs? Ray Tune	1	409	February 6, 2023
Error when setting up bayesian optimizer with asha scheduler Ray Tune	1	512	October 18, 2021
Most runs immediately failing with "out of memory" Ray Tune	5	1212	May 11, 2021

When to use ASHA

Related topics