Hi,
I just trying to understand why this happens. I am training PPO with a custom deterministic environment. For each iteration (training rollout) I typically have 8 episodes, for which i am plotting min/max/mean reward. I am not sure to understand why PPO, once discovers some good actions leading to an highest max rewards, soon after forgets those and comes back to actions leading to a smaller reward (see below chart as an example). Any help in understanding this behavior and which hyperparameters i should play with, to address it? (I am using pretty much default values for all hyperparameters currently)
Thanks,
Antonio