Diffrences between the PPO implementation and the origonal PPO paper

sven1977 · May 11, 2021, 7:40am

Not sure why there’s this difference. Would have to ask @ericl , whether he remembers, why we did this in our PPO implementation (seeing this already in ray=0.8.0).
I don’t think we should change this right now. It may break people’s baselines and tuned configs.
The different is also only marginal imho (both get the job done keeping the kl_coeff close to its target value automatically):

        if sampled_kl > 2.0 * self.kl_target:
            self.kl_coeff *= 1.5
        elif sampled_kl < 0.5 * self.kl_target:
            self.kl_coeff *= 0.5

vs

        if sampled_kl > 1.5 * self.kl_target:
            self.kl_coeff *= 2.0
        elif sampled_kl < 0.66666 * self.kl_target:  # (paper says: "sampled_kl < self.kl_target / 1.5")
            self.kl_coeff *= 0.5

Topic		Replies	Views
Tradeoff between: clipped surrogate objective - adaptive KL-penalty coefficient RLlib	3	759	December 9, 2021
Breakdown of config and metrics of PPO implementation RLlib	0	664	February 23, 2022
PPO gives "Infinity" value for kl and total_loss RLlib	5	1521	October 1, 2021
Unable to replicate original PPO performance RLlib	0	173	May 10, 2024
PPO - Load checkpoint from previous version fails RLlib	2	876	March 17, 2022

Diffrences between the PPO implementation and the origonal PPO paper

Related topics