Saving and restoring a trial state with TrialScheduler

I am trying to write a custom scheduler that pauses trials at some point and then continues them at a later time. Currently the scheduler reruns the paused trials from the beginning, how can I implement this behaviour? In the example of the Trial_Scheduler class there is no example on how to use the save/restore functions.

Abstracting the TrialScheduler class, my code looks like this:

def choose_trial_to_run(self, tune_controller: "TuneController") -> Optional[Trial]:
    for trial in tune_controller.get_trials():
        if trial.status == Trial.PAUSED and condition == True:
            # restore and return trial

def on_trial_result(self, tune_controller: "TuneController", trial: Trial, result: Dict) -> str: 
    if some_other_condition == True:
        # save trial
        return TrialScheduler.PAUSE

Hi @caesar025,

All you need to do is implement checkpoint saving/restoring in your training function. This way, when a trial “unpauses”, it’ll pull the latest checkpoint that was saved to restore the training state. Make sure that the frequency at which you checkpoint is at least as often as how frequently you pause – otherwise you’ll keep restarting from scratch without getting a chance to checkpoint.

See here:

https://docs.ray.io/en/releases-2.7.1/train/user-guides/checkpoints.html#saving-checkpoints-during-training

Ok, then the save/restore functions from the Trial Scheduler class have nothing to do with saving and restoring trials?

Yes, the save and restore on a trial scheduler refers to saving the state of the scheduler, rather than the state of individual trials.

1 Like