Stop experiment, but finish currently running trials

Question: Using a stopper with the stop_all method as described in the docs will stop all currently running trials and not queue any new ones, i.e., stop the experiment. How can I stop the experiment but still finish currently running trials?

Use case 1:

I monitor my tune experiments live. Sometimes this makes me realize that my tune config isn’t ideal (for example, the range of one of the hyperparameters is too limited, or I think of something else). So, I want to change my tune settings. However, I don’t want to kill/stop trials that are already running for two reasons:

  1. The trials might be running for more than hour, so I waste a lot of time if I just kill them
  2. If I kill them before they stop because of a plateau or similar TrialStopper, then the performance associated with their hyperparameters will be misleading in any analysis of the hyperparameter space. Therefore I would then have to clean up all non-completed trials manually.

Use case 2:

This feature would also allow me to device a workaround for issues with time limitations of ray workers that are submitted to a batch system (see this related question of mine).

Possible solutions:

  • Perfect solution: An option on the ray dashboard or a command to connect to the ray head that triggers this kind of “soft” stopping of the experiment
  • Extending the stopper class: If I could write a Stopper with a soft_stop_all method or have afinish_trials_upon_experiment_stop parameter/attribute, I could easily build something on my own (for example trivially by checking if a certain file exists and then performing a soft stop).
  • Overwriting n_trials: Could I change n_trials during the experiment? Then I could just set it to the exact number of trials that have already run/are running to have this effect.
  • Make workers not accept jobs: Another hacky way would be to set all workers somehow not to accept new jobs. Then the main tune script could not enqueue any new trials after the current ones have finished and I can safely kill and restart it.

Moved this to Ray Air, Ray Tune because your requirements are targeted at these APIs.

Can you write a customized scheduler for that?
As you correctly pointed out, I don’t think Stopper class supports this out of box.

Thank you @arturn @xwjiang2010!

@xwjiang2010: That sounds like a clean solution. I’m currently looking at the TrialScheduler class. The relevant method here would be the choose_trial_to_run method, and we’d make it return None. However, I don’t think this would make ray return automatically, right? It would probably be stuck in a loop, waiting for the remaining trials to become available.

At least I don’t immediately see any abort mechanics in TrialRunner.step, that calls choose_trial_to_run via TrialRunner._update_trial_queue_and_get_next_trial().

Implementing my new behavior in the TrialRunner class would be relatively easy, but that class doesn’t seem to be exposed for subclassing so easily.

However, I could get a hold of the instance because choose_trial_to_run is called with the TrailRunner as an argument, so I could do something like this as a workaround:

def SoftStopEnabledTrialScheduler(TrialScheduler):
    def need_soft_stop(self):
        ...

    def choose_trial_to_run(self, trial_runner: "trial_runner.TrialRunner"):
        if self.need_soft_stop():
            # This ensures that we don't enqueue additional trials
            trial_runner._search_alg.set_finished()
            # If we enqueued any other trials, let's remove them
            for trial in trial_runner._trials:
                if trial.status == Trial.PENDING:
                    trial_runner.trial_executor.stop_trial(trial)
            return None
        return super().choose_trial_to_run(trial_runner)

As you can see, it needs quite a few private members of the TrialRunner, so this is only a hack.

Ideally, I would like to develop something that can be added to the ray SLURM guide (I don’t think it works for all workflows yet, because of issues like this one).

Looking at the source, it would be very easy to implement this as a feature to the Stopper class instead, adding a new soft_stop_experiment method and then having TrialRunner check on it.

The modifications required are small (less than 10 LoC), and adding the method to Stopper would be backward compatible.

Do you think a corresponding PR would be accepted? @arturn @xwjiang2010

Ah got you!
Thanks for giving it so much thought.
So I wonder is the logic to determine if it’s time to “soft stop” easy to be implemented? Or do you rather want to manually check some dashboard and then trigger the “soft stop” behavior?

I do notice that what you listed as “perfect solution” is something that’s manually triggered by you?

cc @Huaiwei_Sun

Thank you for reading through all this @xwjiang2010 :slight_smile:

An easy first step would be to add a soft_stop_experiment method to Stopper that just returns False and leave it open to users to fill that logic in their own Stopper by subclassing (the important change would just be that TrialRunner checks the new method).

In the “manual stop” use case outlined above, I could then just cook something trivial for now (e.g., check if a ~/SOFT_STOP_EXPERIMENT file exists).

But I think many people would simply prefer a “soft stop” over a “hard stop” for many stoppers. For example a version of the TimeoutStopper that just stops enqueuing new trials but finishes the ones that are currently running.

I’m happy to work on this if this sounds promising.

actually thinking about it more, would it be possible for you to implement a customized Searcher (instead of a customized Scheduler) to achieve that?

See ray/tune/search/search_generator.py::create_trial_if_possible/next_trial

Basically can you overwrite searcher.suggest method to return Searcher.FINISHED when certain criteria are met.

Just trying to see if what you asked for can already be met with current API. It seems so.