[tune] Using an experiment-wide Stopper sometimes terminates prematurely

Hello, I have been using Tune (1.2.0dev) without any problem, but recently, I had some problems with experiment-wide stoppers like ExperimentPlateauStopper and my own custom stopper. Sometimes, the experiment exited before it’s supposed to (stopping condition was clearly not met), and when this happened, it output the following warning: Skipping cleanup - trainable.stop did not return in time. Consider making stop a faster operation. Does anyone know what causes this and/or how to fix it? Thanks!

A bit more context: each trial trains a neural network using Pytorch, and tune.report is called only once at the end of each training. This problem happens with both BayesOptSearch and HyperOptSearch. I also use ConcurrencyLimiter.

Hey @chawins thanks for making this issue!

Could you post a full reproducible script for us to take a look at?

Thanks!

Thanks, @rliaw. The whole thing is a pretty large code base, but I will see what I can do.
I’m also running the stopper on version 1.1.0 to see if the problem persists.

Hello, @chawins I have also encountered such a problem, have you fixed it? Thanks.

Hi @zmin1217,

Do you have a simple repro script? If so, could you create a Github issue with that script attached?

@justinvyu There are no simple scripts, and i have already set TUNE_FORCE_TRIAL_CLEANUP_S=1 to temporarily fix it, which will forcibly cleanup by terminating actors.

Hi @zmin1217,

What’s the exact error you are seeing? Is it this:

Skipping cleanup - trainable.stop did not return in time.

What version of Ray are you on?

@justinvyu , yes , it repeatedly outputs
2023-04-13 09:08:08,022 WARNING ray_trial_executor.py:146 -- Skipping cleanup - trainable.stop did not return in time. Consider making stop a faster operation. and the version of Ray is 1.8.0

Could you try upgrading to the latest Ray version? It’s hard to give a fix for the old version, and this error message is no longer emitted by the current version of Ray.