How to pretrain a model with behavior cloning

I asked this question on Ray’s Slack channel and now I am transcribing the Thread here so it’s available to everyone. Thanks to @sven1977 and @rliaw for answering my question.

Bruno Brandao:
Does anyone know how can I pre-train a model with Behaviour Cloning and then load it into PPO for further training? I understand that they are of different kinds (offline and online), but this will be addressed. The BC training will use offline dataset, and the PPO will use a proper environment.

Richard Liaw (ray team):
cc @Sven Mika maybe we can create an example? seems also related to what @Michael Luo is working on

Sven Mika:
Sure, hey Bruno Brandao. There is a BC learning test inside rllib/agents/marwil/tests/test_bc.py, where you can see how to train a BC agent.
You could then store the BC agent’s weights (trainer.get_weights()) and re-load these (trainer.set_weights(…)) into a new PPO agent (using the same model!). Would that work?

Sven Mika:
I’ll create an example script.

Bruno Brandao:
Hi, thank you for responding so fast. Yes, i think the get.weights() and set.weights() might work, I’ll try it out and come back with what I find. The example script would be amazing, I looked around, there are other people with the same question/issue.

The solution works and it is very simple. Here is an example of the code, it can be run step by step in a notebook to see the outputs and compare.

import ray
ray.init(ignore_reinit_error=True)
from ray.rllib.agents.marwil.bc import BCTrainer, BC_DEFAULT_CONFIG

BC_DEFAULT_CONFIG['env'] = 'CartPole-v0'
BC_DEFAULT_CONFIG['model']['vf_share_layers'] = False
BC_DEFAULT_CONFIG['model']['fcnet_hiddens'] = [32,16]

bcloning = BCTrainer(BC_DEFAULT_CONFIG)

BC_DEFAULT_CONFIG

Do the BC training you wish in here, then you get the weights.

bcweights = bcloning.get_weights()
bcweights
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

DEFAULT_CONFIG['env'] = 'CartPole-v0'
DEFAULT_CONFIG['model']['vf_share_layers'] = False
DEFAULT_CONFIG['model']['fcnet_hiddens'] = [32,16]

ppotrainer = PPOTrainer(DEFAULT_CONFIG)

Get weights for the ppo just to see/show that they are different.

ppotrainer.get_weights()

Put the BC trained weights in the ppo trainer.

ppotrainer.set_weights(bcweights)

Check the ppo weights again, you’ll see that they match, now the trainer can start the PPO training.

ppotrainer.get_weights()

The thing to pay the most attention to, is to make sure the configuration of both models match, otherwise the weights wont match as well.

5 Likes

Thanks @BrunoBSM for copying and sharing this here! :slight_smile:

1 Like

Hello @BrunoBSM, thanks for sharing, it helped me a lot.

I was able to do this transfer learning between BC and PPO. I know that because get a good reward when evaluating the PPO agent. However, when I start training the PPO agent it behaves likes it is starting to learn from zero, it has bad episode rewards like it would have if it started randomly without any pre-training.

Was this your experience also? Is there a way to continue from the pretrained?

Hi @Anas_BELFADIL, can you provide some more information as of how you’re starting the PPO training?

I ask because I’ve been having trouble using these weights inside Tune. Although today I found a solution by mathpluscode that might be the answer to this: It’s the las comment by him

When I start the PPO training, after using set_weights, I just go through a training loop using the trainer I created.

I just start naively with ppo_agent.train(). Does this re-intialize the weights and discards the weights we set before?

No worries, ppo_agent.train() itself does not reset the weights.
Only creating the trainer (first step: trainer = ppo.PPOTrainer(config=...)) does, but if you load the weights (trainer.set_weights() or trainer.get_policy().set_weights()) after that and then start train()ing, it’ll be fine and the pre-trained weights will be used.

1 Like

@sven1977, I think without the past data, probably the restored weights gonna be forgotten after a few iterations (due to catastrophic forgetting).

1 Like

@felipeeeantunes, yes that could very well be the case, that starting to learn with new data (from a previous BC run), e.g. via PPO, could quickly lead to degrading performance, meaning worse than what the BC run achieved. But this may have to be tried from case to case.

1 Like

In my experience, train()ing after trainer.set_weights() doesn’t continue training, it somehow loses the pre-training. But using this solution in the last comment here last comment as pointed by @BrunoBSM, solves the problem. Here is a notebook Colab with recreation of the problem and the solution, I basically:

  • train a ppo agent on cartpole, it is now getting a 200 score in rewards
  • transfer the weights to a new ppo, I evaluate it, it is working fine, getting a score of 200.
  • But, then when I start training the performance is exactly like if I have started from scratch at around 20.
  • Now I do the transfer with the other method, and when I start training it start from around 100.
1 Like

Hi I had the same issues but have fixed it by just saving and loading a new checkpoint

bcweights = bcloning.get_weights()

agent = ppo.PPOTrainer(config)

agent.set_weights(bcweights)
checkpoint_path = agent.save()
print(checkpoint_path)

# may want to add ray.shutdown and ray.init here

agent = ppo.PPOTrainer(config)
agent.restore(checkpoint_path)

# now train as normal
1 Like

Great Job. Thanks a lot!

Any chance we could get an update to this thread/example? Many functions in these examples have been deprecated (such as PPOTrainer()). I would love to implement BC or MARWIL pretraining to then transfer into a PPO RL algorithm but can’t seem to get it to work. Thanks!

1 Like

Hi @Ryan_Spangler, could you be a little more specific about what exactly you do not get to work? There is an example here about how to use a pretrained Policy to copy weights over. This is exactly how you would do this after you trained your Policy with BC and want to start PPO with these parameters. But at this point we do not know what are your specific problems.

1 Like

That’s the example I couldn’t find! Thank you!

I was using Tune to pretrain a model and I was struggling to restore the policy from a checkpoint and get the weights to copy over.

1 Like

HI @Ryan_Spangler , did you managed to make it work using @Lars_Simon_Zehnder example ? I am struggling to recover weight