I’ve noticed that the kl coefficient update rule for PPO in RLLIB is different than the one given in the original paper. Does anyone know why this is? Is it a bug or is there something I’m missing here?
def update_kl(self, sampled_kl):
# Update the current KL value based on the recently measured value.
if sampled_kl > 2.0 * self.kl_target:
self.kl_coeff *= 1.5
elif sampled_kl < 0.5 * self.kl_target:
self.kl_coeff *= 0.5
# Return the current KL value.
return self.kl_coeff
And the original paper gives the following update rule for an adaptive KL surrogate objective
As far as I can see, RLLibs implementation indeed differs from the one in the original paper.
The original paper also says: “The parameters 1.5 and 2 above are chosen heuristically, but the algorithm is not very sensitive to them.”. So this probably goes unnoticed in practice. @sven1977 should I open an issue and assign myself to it?
Not sure why there’s this difference. Would have to ask @ericl , whether he remembers, why we did this in our PPO implementation (seeing this already in ray=0.8.0).
I don’t think we should change this right now. It may break people’s baselines and tuned configs.
The different is also only marginal imho (both get the job done keeping the kl_coeff close to its target value automatically):
Probably it was implemented based on earlier iterations of the PPO paper. I don’t think this hyperparameter is particularly sensitive though, it would be safer to leave it alone to avoid any backward incompatible changes.