Diffrences between the PPO implementation and the origonal PPO paper

I’ve noticed that the kl coefficient update rule for PPO in RLLIB is different than the one given in the original paper. Does anyone know why this is? Is it a bug or is there something I’m missing here?

    def update_kl(self, sampled_kl):
        # Update the current KL value based on the recently measured value.
        if sampled_kl > 2.0 * self.kl_target:
            self.kl_coeff *= 1.5
        elif sampled_kl < 0.5 * self.kl_target:
            self.kl_coeff *= 0.5
        # Return the current KL value.
        return self.kl_coeff

And the original paper gives the following update rule for an adaptive KL surrogate objective
image

I would be very interested to have an answer about this indeed.

As far as I can see, RLLibs implementation indeed differs from the one in the original paper.
The original paper also says: “The parameters 1.5 and 2 above are chosen heuristically, but the algorithm is not very sensitive to them.”. So this probably goes unnoticed in practice.
@sven1977 should I open an issue and assign myself to it?

1 Like

Yeah, I think opening an issue is a safe move here @arturn .

Not sure why there’s this difference. Would have to ask @ericl , whether he remembers, why we did this in our PPO implementation (seeing this already in ray=0.8.0).
I don’t think we should change this right now. It may break people’s baselines and tuned configs.
The different is also only marginal imho (both get the job done keeping the kl_coeff close to its target value automatically):

        if sampled_kl > 2.0 * self.kl_target:
            self.kl_coeff *= 1.5
        elif sampled_kl < 0.5 * self.kl_target:
            self.kl_coeff *= 0.5

vs

        if sampled_kl > 1.5 * self.kl_target:
            self.kl_coeff *= 2.0
        elif sampled_kl < 0.66666 * self.kl_target:  # (paper says: "sampled_kl < self.kl_target / 1.5")
            self.kl_coeff *= 0.5

What does @ericl say? The issue is open for now in GH.

Probably it was implemented based on earlier iterations of the PPO paper. I don’t think this hyperparameter is particularly sensitive though, it would be safer to leave it alone to avoid any backward incompatible changes.

2 Likes