resume option for
tune.run tries to resume from the last checkpoint that’s recorded in the experiment state (if I understand it correctly).
However, in many cases the last checkpoint is corrupted or might not exist, so tune just gives up and marks the trial as failed.
Could that be possible to back-off to a previous existing checkpoint and try that? E.g. if I have:
checkpoint_000001 checkpoint_000002 checkpoint_000184 checkpoint_000200
checkpoint_000200 is corrupted, then tune should try to reload from
checkpoint_000184, then from
checkpoint_000002, then from
checkpoint_000001 and then it should give up (or restart).
One issue is that if there’s a big gap (e.g. from
checkpoint_000002) then that big back-off might be undesired. However, if there’s no valid checkpoint, then I would rerun that trial anyway, so maybe it’s not an issue.