[tune] Using an experiment-wide Stopper sometimes terminates prematurely

Hello, I have been using Tune (1.2.0dev) without any problem, but recently, I had some problems with experiment-wide stoppers like ExperimentPlateauStopper and my own custom stopper. Sometimes, the experiment exited before it’s supposed to (stopping condition was clearly not met), and when this happened, it output the following warning: Skipping cleanup - trainable.stop did not return in time. Consider making stop a faster operation. Does anyone know what causes this and/or how to fix it? Thanks!

A bit more context: each trial trains a neural network using Pytorch, and tune.report is called only once at the end of each training. This problem happens with both BayesOptSearch and HyperOptSearch. I also use ConcurrencyLimiter.

Hey @chawins thanks for making this issue!

Could you post a full reproducible script for us to take a look at?


Thanks, @rliaw. The whole thing is a pretty large code base, but I will see what I can do.
I’m also running the stopper on version 1.1.0 to see if the problem persists.