What is the proper way of restoring a train experiment state when it gets stopped for some reason? Running TorchTrainer I thought that providing the directory of checkpoints to resume_from_checkpoint would be ok, but the whole trainning starts over again from the beginning.
Hey @0piero, are you loading the checkpoint from the training function?
I Actually wasn’t doing what you pointed here. Instead, I was just passing the checkpoint to resume_from_checkpoint argument and expecting it to automatically resume from the state where the run stopped last time, the same way that occurs in tune.run resume behavior. Is there a way of having this with Trainer API ?
Same question.
I am using RLtrainer in Tune,
the Trainer provides a resume_from_checkpoint argument, and the inside checks if the checkpoint is available.
HOWEVER, after checking ,the RLtrainer does nothing with the checkpoint, it does not assign any checkpoint weights to the model.
how can I manage to continue my training from already trained checkpoints?