If my trainable has a bug, it will consistently run all trials. result in all ERROR trials.
I would like to stop if max allowed ERROR trails exceeds.
There is a Tune.run(fail_fast) parameter, which stops experiment on first error, which is not what I want.
Could you please suggest how to do it?
Hey @metaphor, could you explain the use-case a little more? Would it be reasonable to try running your Tune job with a smaller number of samples first to verify that there are no bugs before proceeding with a full sweep?
it’s a production auto machine learning platform based on Ray. we won’t be able to know ‘bugs’ in advances since user inputs varies.
It could run 1000+ trials in some cases. We just want a way to stop experiment in case of continuous ERROR trials. Don’t let the user to wait for too long.
Gotcha, that makes sense!
I took a look and saw that our current Stopper
API doesn’t quite have access to this information. I’ve created a Github issue to track this here: [Feature][Tune] Trial status based Stopper · Issue #21222 · ray-project/ray · GitHub
thanks. i will follow it.
Meanwhile, the workaround could be wrapping the Searcher(much like ConcurrencyLimiter), which has lifecycle callback on_trial_complete(error: bool = False) , count the Error, and stop suggesting by return Searcher.FINISHED if tolerance exceeds.