Not sure why there’s this difference. Would have to ask @ericl , whether he remembers, why we did this in our PPO implementation (seeing this already in ray=0.8.0).
I don’t think we should change this right now. It may break people’s baselines and tuned configs.
The different is also only marginal imho (both get the job done keeping the kl_coeff close to its target value automatically):
if sampled_kl > 2.0 * self.kl_target:
self.kl_coeff *= 1.5
elif sampled_kl < 0.5 * self.kl_target:
self.kl_coeff *= 0.5
vs
if sampled_kl > 1.5 * self.kl_target:
self.kl_coeff *= 2.0
elif sampled_kl < 0.66666 * self.kl_target: # (paper says: "sampled_kl < self.kl_target / 1.5")
self.kl_coeff *= 0.5