Currently the resume option for tune.run tries to resume from the last checkpoint that’s recorded in the experiment state (if I understand it correctly).
However, in many cases the last checkpoint is corrupted or might not exist, so tune just gives up and marks the trial as failed.
Could that be possible to back-off to a previous existing checkpoint and try that? E.g. if I have:
and checkpoint_000200 is corrupted, then tune should try to reload from checkpoint_000184, then from checkpoint_000002, then from checkpoint_000001 and then it should give up (or restart).
One issue is that if there’s a big gap (e.g. from checkpoint_000184 to checkpoint_000002) then that big back-off might be undesired. However, if there’s no valid checkpoint, then I would rerun that trial anyway, so maybe it’s not an issue.
I’m not sure what would be the right logic to handle that. The two scenarios are:
there’s a checkpoint that’s a few iterations behind the latests → great
the most recent checkpoint (after the latest) is very old → since the last checkpoint is corrupted, I would need to go back no matter what, since I can’t use the last one. So I think there’s no “too old” in this sense, or at least anything will do at that point for me (but other people might have a different preference).
However, this might cause issues with the logs, e.g. Tensorboard or the CSV logger will have a sudden break in the iterations.
Another thing that may be worthwhile to look into is why the checkpoints do not exist. Is it something you run into often? How big are the checkpoints and how often are things checkpointed? Also are you updating to cloud storage or syncing to driver node?
I’m logging locally, the checkpoints are a few MB, but sometimes a few hundred, depending on the model. I usually checkpoint every iteration or every 5-10, and keep only the 3 best.
Usually this issue happens when I CTRL+C the run and in some cases Tune gets stuck and doesn’t exit cleanly. It doesn’t happen often, and I can avoid killing the run.
One case that it definitely happens almost always, is when I don’t wait for the last checkpoint after a CTRL+C. I rarely do this, but when I’m using Slurm or some other cluster manager and my job gets killed, then it’s up to the manager to leave enough time for Tune to gracefully exit, which might not always happen. In those cases this back-off would be useful once the training job is resumed.
I can report we have a similar issue.
We also notice sometimes corrupted checkpoints. I think it is due to what @vakker00 describes above.
We use Azure low-priority VMs (discounted VMs based on datacenter surplus capacity). When there is insufficient surplus capacity, the VMs get killed and restarted when sufficient surplus capacity is available again. I have no control over how long Tune gets to gracefully exit (to be honest, I think Tune gets practically zero time because Azure simply stops the Docker container running Tune’s Python process). I think this is similar to hitting CTRL+C a couple of times in a row quickly.
On restart of the VM, almost always Tune fails to continue the run. It finds the checkpoint but cannot load it (says FileNotFound while it is really there and it also knows it want’s to resume from there so it can find checkpoint files). Or ‘thinks’ a checkpoint should be present, but in fact is not.
I would love functionality for Tune to then try the before last checkpoint!
We also keep the 5 best checkpoints. But if this functionality gets implemented, I would also want to keep the 5 last checkpoints in addition to have better changes of Tune recovering and continuing a run. If resuming is impossible from all checkpoints, a flag to say “then just restart from the beginning” would be great. In practice we start a run to get finished, so the less delay in restarting the better.
Furthermore, keep up the good work! I like Ray (and its potential) very much!
Yes, there are some issues with state consistency when Tune runs are hard killed. I agree we should build more flexibility into resuming from older checkpoints (and try to keep experiment checkpoint state consistent).
Unfortunately there is no immediate workaround for that, other than maybe copying/renaming checkpoint directories to what Tune expects at the current state.
But we will definitely consider these comments in our planning!