ASHA is a scheduler that stops training tasks than have sub-optimal performance when compared to the other available algorithms.
When should we use ASHA? Is there a particular subset of the available algorithms that work well with ASHA and others that do not? Should we use other schedulers in those cases?
The comparison is in the apendix.
I use ASHA when hyperparameters spce is not very big, but training is very long and I want terminate uneffective trials as soon as possible.
Thank you @Peter_Pirog!
What would be your definition of big hyper-parameters space?
What is your usual ASHA parameterization?
I was told by @rliaw to avoid ASHA when using HyperOpt, do you have any opinion on this?
@LucaCappelletti94 , of course definition of “big search space” is very relaitve.
Typicaly I use ASHA if I want to check all hyperparameters combination, it’s possible but it needs a lot time.
For example in supervising learning I want to check keras model with hundred combinations of dense layers units values, initializers, activations, dropout values etc. defined by ray.tune.grid_search, ASHA can check all these combination and treminate them as soon as possible, in default settings (reduction_factor=3) termination after 3,9,27,81 etc. iterations.
ASHA doesn’t transfer knowledge about hyperparameters from one trial to another, the trial is good and it’s continued or bad and it’s terminated.
In PB2 knowledge about good hyperparameters in one trial is transfered to the others so I think it’s good for ranges ray.tune.uniform, ray.tune.randint
This is only my experience with shcedulers (ray is new for me and I try to unerstand it) so ASHA works fine in these circumstances. I think @rliaw is more experienced than me so don’t ignore his hints
Thanks! If I understood correctly, PB2 is in itself basically also a tuning algorithm and it does not play along with other algorithms. Is this correct?
Sorry, do you mean PB2 or ASHA as tuning algorithm? If I understood the schedulers correctly, ASHA is just an early stopping mechanism and does not provide any tuning by itself. PB2 instead, by introducing a mutation mechanism, does indeed provide tuning.
I use PB2 or ASHA , terminating uneffective combinations is some kind of tuning. In arxiv docs https://arxiv.org/pdf/2002.02518.pdf ASHA, PBT and PB2 are compared as alternative methods of hyperparameter tuning.
As I observed, for both algorithms at the begining parameters are choosen randomly, but durong training PB2 use bayesian conditional probabilities to select next set of parameters, so PB2 use experience from all previous trials.
There is nice tool in tensorboard to observe hyperparameters influence.
Thank you, I would love to see a good example of it! I’ve tried to use it but I must be doing something wrong, the hyper-parameters selected seem to be chosen exclusively from the space minimum or maximum.
patience argument represents the number of epochs before stopping once your loss starts to increase (stops improving). This depends on your implementation, if you use very small batches or a large learning rate your loss zig-zag (accuracy will be more noisy) so better set a large patience argument. If you use large batches and a small learning rate your loss will be smoother so you can use a smaller patience argument. Either way I’ll leave it as 2 so I would give the model more chance