PPO forgetting some good actions


I just trying to understand why this happens. I am training PPO with a custom deterministic environment. For each iteration (training rollout) I typically have 8 episodes, for which i am plotting min/max/mean reward. I am not sure to understand why PPO, once discovers some good actions leading to an highest max rewards, soon after forgets those and comes back to actions leading to a smaller reward (see below chart as an example). Any help in understanding this behavior and which hyperparameters i should play with, to address it? (I am using pretty much default values for all hyperparameters currently)


This is nothing unusual. ANNs can forget for many reasons. Generally, you simply want to checkpoint often enough to capture policies that perform good. Also you can take measures to ensure training stability, like using a learning rate that is small enough, don’t evaluate asynchronously, clip gradients etc.

1 Like