Diffrences between the PPO implementation and the origonal PPO paper

levmckinney · April 15, 2021, 3:08am

I’ve noticed that the kl coefficient update rule for PPO in RLLIB is different than the one given in the original paper. Does anyone know why this is? Is it a bug or is there something I’m missing here?

github.com

ray-project/ray/blob/5f0be94989030f0a796c49ec14b5a96878a5ba06/rllib/agents/ppo/ppo_torch_policy.py#L188


    coefficient after each learning step based on `config.kl_target` and
    the measured KL value (from the train_batch).
    """

    def __init__(self, config):
        # The current KL value (as python float).
        self.kl_coeff = config["kl_coeff"]
        # Constant target value.
        self.kl_target = config["kl_target"]

    def update_kl(self, sampled_kl):
        # Update the current KL value based on the recently measured value.
        if sampled_kl > 2.0 * self.kl_target:
            self.kl_coeff *= 1.5
        elif sampled_kl < 0.5 * self.kl_target:
            self.kl_coeff *= 0.5
        # Return the current KL value.
        return self.kl_coeff


class ValueNetworkMixin:

    def update_kl(self, sampled_kl):
        # Update the current KL value based on the recently measured value.
        if sampled_kl > 2.0 * self.kl_target:
            self.kl_coeff *= 1.5
        elif sampled_kl < 0.5 * self.kl_target:
            self.kl_coeff *= 0.5
        # Return the current KL value.
        return self.kl_coeff

And the original paper gives the following update rule for an adaptive KL surrogate objective

Clement_Collgon · May 5, 2021, 2:28pm

I would be very interested to have an answer about this indeed.

arturn · May 5, 2021, 5:20pm

As far as I can see, RLLibs implementation indeed differs from the one in the original paper.
The original paper also says: “The parameters 1.5 and 2 above are chosen heuristically, but the algorithm is not very sensitive to them.”. So this probably goes unnoticed in practice.
@sven1977 should I open an issue and assign myself to it?

rliaw · May 8, 2021, 1:46am

Yeah, I think opening an issue is a safe move here @arturn .

sven1977 · May 11, 2021, 7:40am

Not sure why there’s this difference. Would have to ask @ericl , whether he remembers, why we did this in our PPO implementation (seeing this already in ray=0.8.0).
I don’t think we should change this right now. It may break people’s baselines and tuned configs.
The different is also only marginal imho (both get the job done keeping the kl_coeff close to its target value automatically):

        if sampled_kl > 2.0 * self.kl_target:
            self.kl_coeff *= 1.5
        elif sampled_kl < 0.5 * self.kl_target:
            self.kl_coeff *= 0.5

vs

        if sampled_kl > 1.5 * self.kl_target:
            self.kl_coeff *= 2.0
        elif sampled_kl < 0.66666 * self.kl_target:  # (paper says: "sampled_kl < self.kl_target / 1.5")
            self.kl_coeff *= 0.5

arturn · May 16, 2021, 1:11pm

What does @ericl say? The issue is open for now in GH.

ericl · May 16, 2021, 7:07pm

Probably it was implemented based on earlier iterations of the PPO paper. I don’t think this hyperparameter is particularly sensitive though, it would be safer to leave it alone to avoid any backward incompatible changes.

Topic		Replies	Views
Tradeoff between: clipped surrogate objective - adaptive KL-penalty coefficient RLlib	3	730	December 9, 2021
Breakdown of config and metrics of PPO implementation RLlib	0	659	February 23, 2022
PPO gives "Infinity" value for kl and total_loss RLlib	5	1504	October 1, 2021
Unable to replicate original PPO performance RLlib	0	171	May 10, 2024
PPO - Load checkpoint from previous version fails RLlib	2	871	March 17, 2022

Diffrences between the PPO implementation and the origonal PPO paper

Related topics