Finetuning MBMPO policy

I am trying to finetune the MBMPO policy using PPO.

Restoring MBMPO checkpoint directly to PPO (ppo_agent.restore(checkpoint_path)) is not working and gives below error due to a different state of the optimizer:

ValueError: loaded state dict contains a parameter group that doesn’t match the size of optimizer’s group

To avoid this issue, I take weights of MBMPO policy and set it in ppo agent using set_weights. Though that is giving me below error:

RuntimeError: Error(s) in loading state_dict for FullyConnectedNetwork:
Missing key(s) in state_dict: “_value_branch_separate.0._model.0.weight”, “_value_branch_separate.0._model.0.bias”, “_value_branch_separate.1._model.0.weight”, “_value_branch_separate.1._model.0.bias”.

The weights dict of MBMPO policy have these keys: “_logits._model.0.weight”, “_logits._model.0.bias”, “_hidden_layers.0._model.0.weight”, “_hidden_layers.0._model.0.bias”, “_hidden_layers.1._model.0.weight”, “_hidden_layers.1._model.0.bias”.

Question: How to fine-tune MBMPO policy?

@sven1977 @michaelzhiluo1

Hi @Nehal_Soni ,

If you want to use a PPO policy in MBPO, you will have to make sure that the weights dict exactly matches the networks you are trying to restore. This will not be possible without some digging and handcrafting and is not supported by our public APIs.

  1. Instantiate an MBPO Algorithm on your environment with the rest of the settings mirroring your PPO settings when possible
  2. Extract the config vom the Algorithm via algorithm.get_policy(DEFAULT_POLICY_ID)
  3. Have a good look at the models contained in the policy, print them
  4. Do the same for PPO and compare models - you will have to make them look the same in order for anything going forward to make sense
  5. Write your own policy classes, inheriting from the MBPO and PPO policy of the framework of your choice
  6. Modify their get_weights and set_weights methods such that the names of all variables match upon restoration


Thank you @arturn for your quick response, it’s helpful.

I understand that MAML policy needs to be fine-tuned and it is possible directly using PPO algorithm of RLlib (This thread mentions it and it has been tested also: MAML finetune adaptation step for inference).

Is there any better approach in RLlib to fine-tune MBMPO policy?


Hi @Nehal_Soni ,

Not that I know of. Our current implementations of policies and models make this process very cumbersome. But @kourosh is redesigning the policy and model APIs and together with connectors, you will probably see much change that will better support your use case.

The thread you mention tells us that it’s possible with MAML, but not that the steps are different. You still have to create a perfect match between model parameters and then manually reconstruct the model - no way around that atm.