How does early termination and trial quality evaluation work?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I made this extremely simple setup to illustrate the question:

from ray import tune
from ray.tune.schedulers import ASHAScheduler
n = 9
func_vals = {
    # score for a = 0 at each "reporting point" of "training"
    0: [1, 2, 3, 4, 10, 4, 3, 2, 1,],
    # score for a = 1 at each "reporting point" of "training"
    1: [1, 2, 3, 4, 3,  2, 1, 0, 1,],
}

def objective(x, a):
    # a should be 0 or 1
    if x >= 9:
        return 0
    else:
        return func_vals[a][x]


space = {
    'a': tune.choice([0,1])
}

def trainable(config):
    # config (dict): A dict of hyperparameters.

    for x in range(n):
        score = objective(x, config["a"])
        tune.report(score=score)  # This sends the score to Tune.

tune_scheduler = ASHAScheduler(
    max_t = 10,
    grace_period=5,
    reduction_factor=2,
)


analysis = tune.run(trainable,
         metric='score',
         mode = 'max',
         config=space,
         num_samples=4,
         scheduler=tune_scheduler,
)

max_score = analysis.best_result['score']
best_a = analysis.best_result['config']['a']
print(f'best score was {max_score} with a = {best_a}')

When the hyperparameter a=0, its best achieved score is 10 at reporting point 5, whereas with hyperparameter a=1, the best is 4 at reporting point 4.
This is intended to simulate behavior of a training loop where the “score” could be say val_auc which may improve up to a point and then decline.

In the above code Ray/Tune returns the best hyperparam as a=1, with max_score=3.
I was very surprised to see this. Is there an explanation as to what is the mechanism behind this?

Two things that would be helpful to understand are:

  • How does early termination work?
  • How does Ray/tune evaluate the quality of a trial – is it based on the last reported score, or best so far?

Thanks for any clarifications.

Hi @pchalasani, I think there are a few things to clarify here.

First, I would suggest to use tune.grid_search([0, 1]) instead of tune.choice([0, 1]). With choice you get a random seleciton - thus all trial could be a=0! (I had this when running your script). If you do this, set num_samples=2 to have 4 trials to run (2 times the full grid search).

Second, early termination with ASHA always uses the latest available result. This means if your trials report 10 and 3 after the 5th step (when they do the first evaluation), ASHA will stop the trials with a=1 (which reported 3) early.

Third, in experiment analysis the property best_result means “take the trial with the best last score and return the result”. Thus, this is a very simple lookup - and it always returns the last reported score.

This also means that in the example you get a=1 and max_score=3 because the trials that continue to run (a=0) eventually report a score of 1, which is lower than 3. Tune does not take the number of iterations into account here, or the fact that the trial was early stopped.

However, you can use other methods of experiment analysis to make different analyses. For instance, you can specify the scope in get_best_trial() to get the best ever observed result (instead of the last):

best_trial_in_run = analysis.get_best_trial("score", "max", scope="all")
print(f"Best result ever was achieved by trial {best_trial_in_run}")

And you can also get a dataframe with the best observed results for further analysis:

df = analysis.dataframe("score", "max")
print(df)

WIth the results:

 best score was 3 with a = 1
Best result ever was achieved by trial trainable_aefb5_00000

   score  ...                                             logdir
0   10.0  ...  /Users/kai/ray_results/trainable_2022-03-25_13...
1    4.0  ...  /Users/kai/ray_results/trainable_2022-03-25_13...
2   10.0  ...  /Users/kai/ray_results/trainable_2022-03-25_13...
3    4.0  ...  /Users/kai/ray_results/trainable_2022-03-25_13...

Ok I was misunderstanding grace_period=5 – I thought it meant , “terminate if there has been no improvement over the last 5 steps”, but instead, it looks like it can be thought of as min_t, analogous to max_t, i.e. run a trail for at least grace_period steps, and at most max_t steps.

So the a=1 trial was killed after 5 steps because the score declined from 4 to 3. But why was the a=0 trial continued much further than 5 steps?

@kai I notice you said “when they do the first evaluation” – I didn’t understand that. I thought the trials are “evaluated” each time a score is reported back, so in this example they are “evaluated” at every step, right? I guess I actually don’t even understand why the a=1 trial is stopped after step 5. Whatever logic I think up to explain this seems to contradict the behavior on the a=0 trial.

This is very helpful to know, thanks. However what if I want the search + scheduler algorithms to use a different notion of “trial quality” than last result, for example best ever result? Absent this, I would worry that a good config for best ever result may be lost due to early termination based on last result.

I suppose one way around this is to ensure that I take control of what is being reported to the TuneReportCallback, e.g. in the PTL validation_epoch_end I could track val_auc_best and specify this metric as the one to optimize when I configure Ray/Tune. I’ve ended up doing this in fact.

To the grace period question: Yes, grace period 5 means “min t” of 5, so each trial runs at least for 5 iterations before pruning can happen.

Then at iteration 5, Tune sees metric = 10 for a=0 and metric = 3 for a=1, so stops the a=1 trials. Does this make sense?

Your workaround sounds like it could work. Note you can specify a separate metric for ASHAScheduler (e.g. val_auc_best) if you want to use a search algorithm with a different metric.

Also, you should probably consider keeping a n-step average rather than the “best” metric. If you coincidentally get a very good result once at the start of the training it will otherwise taint your training results. Generally you should expect earlier results to have higher variance and later results to be a bit more stable (because the model processed much more training data) - that’s also why most algorithms just take the last result into account.

Yes it does, thanks @kai

This is a good point. I am in fact doing training concurrently on k random train-val splits (like in k-fold crossval), and calculate the avg val_auc across these k folds, so the effect of early spikes in the metric should be muted. I also am experimenting with a more fancy metric like val_quality = mean(val_auc) - lambda * std(val_auc) where mean, std are taken over the k fold metrics. This metric favors trials with a high metric and less variance across folds.