ValueError when restoring checkpoint with PPO

starkj · October 20, 2022, 2:57am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have used Ray 2.0.0 to train a simple FCNet of 2 layers (256, 256) and store the result in a checkpoint. Later, I read in the checkpoint using

algo = ppo.PPO(config = config, env = env)
algo.restore(path_to_checkpoint)

This works great to do inferences, although the network’s performance is so-so. However, if I train a model with 3 layers, e.g. [300, 128, 64], then the training works well, but restoring the checkpoint for inference results in the following error message: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group.

I feel like there’s a config item that I should be setting for the optimizer, or something, but don’t find any guidance in the Ray docs. I would be grateful to anyone who can provide some guidance.

starkj · October 20, 2022, 4:04am

Lightbulb - the checkpoint is correctly storing the 3 layers of network structure, and associated info for the optimizer. What I was missing is in creating the algo I only used the default PPO config parameters, which specifies a 2-layer network. So the restore() method was trying to load a 3-layer checkpoint into a 2-layer structure. Once I specify config["model"]["fcnet_hiddens"] = [300, 128, 64] prior to creating algo, the restore works just fine.

Topic		Replies	Views
Tuning fcnet_hiddens with RLlib PPO ValueError: loaded state dict RLlib	2	929	October 20, 2022
Error when loading and restoring a trained algorithm from a checkpoint using a APPO Algorithm RLlib	1	348	February 14, 2023
Restoring APEX_DDPG trainer using checkpoint saved with older ray version RLlib	0	428	May 28, 2021
Restoring a RLModule checkpoint with pytorch RLlib	1	60	February 22, 2025
PPO from checkpoint Checkpointing, Restoring	0	47	September 10, 2024

ValueError when restoring checkpoint with PPO

Related topics