How to resume training from a checkpoint

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi everyone.

Using ray.tune, I am able to train my model and saving the states as training goes. However, I still haven’t found a way to reuse my trained policy on another setting. (by reusing the neural network in a different experiment).

What is the official way to do this ?

@Finebouche I had the same questions a few months ago. I found it odd that Ray Tune doesn’t provide a straightforward way to start a new Tune job from an existing Tune checkpoint. You have to go through some gyrations. I eventually learned how to use a callback to restore just the policy weights from a checkpoint and proceed from there. Of course, that starts over with no optimizer params & other training loop counters, etc. But the raw NN itself can be moved forward. For details, see Tune as part of curriculum training - #14 by gjoliver. It is a bit of a circuitous discussion for a while, but about half way down you see talk of the callbacks.

@starkj Thanks for posting this here!

@Finebouche Have you tried this approach, yet? Using Policy.from_checkpoint() is the officially way here. Also take a look into the example here.

@starkj , Thank you very much for posting here!

Can you give me some advises how to restore optimizer parameters to continue training from the checkpoint? And did you found out where module_state.pt and default_policy_default_optimizer.pt files are read ?

Thank you in advance!

Hi @Lars_Simon_Zehnder, I did try the example you linked but it’s not working.

I will get a look at @starkj solution tomorrow, thanks for the help everyone

@Alex-Golod I did not do anything with optimizer dats, as I just wanted the raw network params (e.g. using it as pre-training).

@starkj Thank you for your reply. In Ray 2.9.0 I have found out this new functionality. It may help with optimizer parameters restore: Learner (Alpha) — Ray 2.9.0