Hyperopt based on best result instead of last result

Hi! I am using Ray Tune with AxSearch to perform hyperopt on my model training, and I was wondering whether Tune should internally report to Ax the best or the last trial result.

I tried to dive deep into Ray Tune code, and what I found is that the TrialRunner allows every scheduler and search algorithm to make use of the result of every training step, however, when the on_trial_complete callback is called, the last result is returned as the result parameter. Since AxSearch only implements the on_trial_complete callback, I think it gets the last result, instead of the best one, which - I suppose - misleads the BO process in the bakcground.

I have also found this topic which seems similar to my question: Best model based on Checkpoint not Last epoch

Can someone tell me, whether my logic and findings are right? Or have I missed something maybe?

Hi @timurlenk07,
your findings are spot on! Ax indeed only uses the last obtained result.

I think is is usually not a problem, however. A BO usually only optimizes a single objective, so we can only report one result back per trial. Intermediate results could thus only be used for early stopping decision and alike.

In the case where you actually have multiple results per trial, you would assume that the trial converges after more iterations. And this is reasonable to assume - after more training, the result should be more accurate and more reliable and less prone to random parameter initialization or data distributions.

What I’m trying to say is, if you train a model, which performs good after a few iterations, but then converges to a suboptimal solution, was the configuration really good, or was it probably just an effect of random initialization or favourable data distributions? I’d argue that you are usually interested in the final outcome.

That said, you are free to report whichever value you want to in your trainable function. So if you would like to use the best result, just keep track of it in your trainable and make sure the last call to tune.report() includes this result.

Thanks for your response @kai!

You have suggested that users should either report results only if they are better than any previous one, or we should re-report the best result at the end of a trial. I feel these both would only be a workaround to the problem, namely that we wish to return the BO library the best results regarding the current optimization objective.

You have mentioned that intermediate results are only usable for early stopping decisions and alike, but if we stop our trial after we see no improvement in evaluation, it means the last result is actually lower than the best one. In this scenario (when we use EarlyStopping to stop the training of an NN), we can only do as you suggest to not mess up the BO process, namely either report infrequently, or re-report the best result. My problem with these solutions is they would mess up the logged values, and if we change the optimization objective in tune.run (for instance we aim for higher precision instead of the f1 score or we want to minimize the loss), we should also change the way we report our metrics in our Trainable.

To sum it up, I feel that what we are talking about could only be a workaround, since it is the Ray Tune library’s responsibility to connect the reported metrics with the chosen optimization library, and taking in regard the given optimization objective. Either that, or the documentation should be much clearer about when, what and how to report in case of each optimization library.

What do you think about this?

In my opinion this is very specific to the problem you are optimizing. I don’t think the BO should try to predict the best seen result. Take the case of RL with epsilon decay, i.e. you are doing random actions at the start and slowly decrease the amount of randomness. In this case policies that converge to a very bad state would still show some acceptable performance in the start of the training (because of random good actions). In fact, here the BO would not be able to distinguish between “somewhat bad” and “really really bad” policies, so we would lose information.

I agree that re-reporting results at the end is a workaround that comes with some problems for logging etc. Generally I’d say adding an option to track the best result in Ax and optionally use this instead of the last result might be a good idea to support a broad range of use cases. Would you be willing to try to contribute such an extension? I’m very happy to help and assist with this!

Per default we should stick with the last result, but if we add support for the best result, we should definitely expand on this in the docs.

Now I see what could be a problem with this (though I would argue that it is a more specific situation when we have continuously worsening performance during training, even with really bad parameter choices).

I feel like implementing it only in the Ax integration would miss the more general point, that often we would like to consider the result of a trial to be the best achieved metric during its runtime. What I could imagine for a solution is that there would be a parameter of tune.run, very much like scope of ExperimentAnalysis.get_best_trial(), where we could decide what we consider the result of each trial, which in turn gets passed on to on_trial_complete callback (probably as a new, fourth parameter). I know this would be a breaking change for all of the current implementations of this callback, so while I am fully aware of this and its complications, what would you think of this idea of mine? Or do you see some other alternative to modifying only the Ax integrations’s implementation?

You may find this relevant: Optimising with respect to the epoch that scored highest for a trial (instead of the last epoch) · Issue #81 · ray-project/ray_lightning · GitHub