Tune resume with trying older checkpoints

Currently the resume option for tune.run tries to resume from the last checkpoint that’s recorded in the experiment state (if I understand it correctly).
However, in many cases the last checkpoint is corrupted or might not exist, so tune just gives up and marks the trial as failed.

Could that be possible to back-off to a previous existing checkpoint and try that? E.g. if I have:

checkpoint_000001  checkpoint_000002  checkpoint_000184  checkpoint_000200

and checkpoint_000200 is corrupted, then tune should try to reload from checkpoint_000184, then from checkpoint_000002, then from checkpoint_000001 and then it should give up (or restart).

One issue is that if there’s a big gap (e.g. from checkpoint_000184 to checkpoint_000002) then that big back-off might be undesired. However, if there’s no valid checkpoint, then I would rerun that trial anyway, so maybe it’s not an issue.

cc @matthewdeng for this Tune question

We currently don’t have support for this, but we are refactoring the checkpoint implementation this quarter and can take this into account.

For clarity, would it be helpful to be able to define a maximum staleness of a checkpoint before it is considered “too old”?

1 Like

Thanks for considering it!

I’m not sure what would be the right logic to handle that. The two scenarios are:

  1. there’s a checkpoint that’s a few iterations behind the latests → great
  2. the most recent checkpoint (after the latest) is very old → since the last checkpoint is corrupted, I would need to go back no matter what, since I can’t use the last one. So I think there’s no “too old” in this sense, or at least anything will do at that point for me (but other people might have a different preference).

However, this might cause issues with the logs, e.g. Tensorboard or the CSV logger will have a sudden break in the iterations.

Another thing that may be worthwhile to look into is why the checkpoints do not exist. Is it something you run into often? How big are the checkpoints and how often are things checkpointed? Also are you updating to cloud storage or syncing to driver node?

Thanks!

I’m logging locally, the checkpoints are a few MB, but sometimes a few hundred, depending on the model. I usually checkpoint every iteration or every 5-10, and keep only the 3 best.

Usually this issue happens when I CTRL+C the run and in some cases Tune gets stuck and doesn’t exit cleanly. It doesn’t happen often, and I can avoid killing the run.

One case that it definitely happens almost always, is when I don’t wait for the last checkpoint after a CTRL+C. I rarely do this, but when I’m using Slurm or some other cluster manager and my job gets killed, then it’s up to the manager to leave enough time for Tune to gracefully exit, which might not always happen. In those cases this back-off would be useful once the training job is resumed.