[train] Resuming Checkpoints in experiment using Trainer

0piero · September 12, 2022, 9:02pm

What is the proper way of restoring a train experiment state when it gets stopped for some reason? Running TorchTrainer I thought that providing the directory of checkpoints to resume_from_checkpoint would be ok, but the whole trainning starts over again from the beginning.

matthewdeng · September 13, 2022, 6:23pm

Hey @0piero, are you loading the checkpoint from the training function?

0piero · September 13, 2022, 11:45pm

I Actually wasn’t doing what you pointed here. Instead, I was just passing the checkpoint to resume_from_checkpoint argument and expecting it to automatically resume from the state where the run stopped last time, the same way that occurs in tune.run resume behavior. Is there a way of having this with Trainer API ?

flyinskybtx · September 16, 2022, 9:24am

Same question.

I am using RLtrainer in Tune,
the Trainer provides a resume_from_checkpoint argument, and the inside checks if the checkpoint is available.

HOWEVER, after checking ,the RLtrainer does nothing with the checkpoint, it does not assign any checkpoint weights to the model.

how can I manage to continue my training from already trained checkpoints?

Topic		Replies	Views
How to resume training from a checkpoint RLlib	6	1684	December 22, 2023
[train] When resuming training, a new `Trial` directory is created, even when resuming from checkpoint	4	432	September 29, 2022
I cannot resume a broken tune run	2	455	September 10, 2023
Trial checkpointing	0	292	June 16, 2023
Resuming Trials with New Checkpoint_Score_Attr / Best Metric	0	443	January 4, 2022

[train] Resuming Checkpoints in experiment using Trainer

Related topics