Hello, I have been using Tune (1.2.0dev) without any problem, but recently, I had some problems with experiment-wide stoppers like ExperimentPlateauStopper and my own custom stopper. Sometimes, the experiment exited before it’s supposed to (stopping condition was clearly not met), and when this happened, it output the following warning: Skipping cleanup - trainable.stop did not return in time. Consider making stop a faster operation. Does anyone know what causes this and/or how to fix it? Thanks!
A bit more context: each trial trains a neural network using Pytorch, and tune.report is called only once at the end of each training. This problem happens with both BayesOptSearch and HyperOptSearch. I also use ConcurrencyLimiter.
Thanks, @rliaw. The whole thing is a pretty large code base, but I will see what I can do.
I’m also running the stopper on version 1.1.0 to see if the problem persists.
@justinvyu There are no simple scripts, and i have already set TUNE_FORCE_TRIAL_CLEANUP_S=1 to temporarily fix it, which will forcibly cleanup by terminating actors.
@justinvyu , yes , it repeatedly outputs 2023-04-13 09:08:08,022 WARNING ray_trial_executor.py:146 -- Skipping cleanup - trainable.stop did not return in time. Consider making stop a faster operation. and the version of Ray is 1.8.0
Could you try upgrading to the latest Ray version? It’s hard to give a fix for the old version, and this error message is no longer emitted by the current version of Ray.