I am using RLlib to design an adaptive controller. I just have a dumb question. As far as I am concerned, PPO implements either a clipped surrogate objective or an adaptive KL-penalty coefficient. However, when I design my RLlib Agent, I have to provide hyperparameters that concern both methods. I was told that RLlib agents trade off between these two approaches. Is that true? If yes, why and where can I find information/documentation about this?
The original PPO paper does not make the choice between using the clipped surrogate objective and the adaptive KL-penalty exclusive. RLlib indeed uses both. I do not know of an article that describes this
But you can see for yourself in the code!
You find the clipped surrogate objective in lines 88ff and and the complete expression of the loss in lines 111ff.
If you want PPO to leave out the KL term, have a look at the ppo execution plan.
You can copy it and simply leave out the KL_Update call in line 294!