Breakdown of config and metrics of PPO implementation

I am posting this, because it may not be obvious for everyone and I also want to confirm if I am completely correct, maybe I am wrong or miss something. From original ppo paper we have this objective function we want to maximize:

Looking at the code at we can see that in rllib we calculate the following loss and I supose minimize it by gradient descent or similar method(I still haven’t figure out what optimizer is used and if we can change it, glad if anyone answer this):

The metric with the key info/learner/default_policy/learner_stats/total_loss corresponds to the above expression, we also have:

info/learner/default_policy/learner_stats/policy_loss that corresponds to:

info/learner/default_policy/learner_stats/vf_loss that corresponds to:

info/learner/default_policy/learner_stats/entropy that corresponds to:

info/learner/default_policy/learner_stats/kl that corresponds to:

From config we have the following correspondence:

config["vf_loss_coeff"] == c_1

config["entropy_coeff"] == c_2

config["kl_coeff"] == the initial value for beta

The use of KL penality is also mentioned in the original paper. Usually we have an implementation that either uses clip or KL penality(PPO-clip and PPO-penalty are terms are used in spinning-up documentation).

Rllib is flexible and provides a way to have an hybrid version that can use at same time a clipped surrogate loss and a regularization based on KL. Some may say that having both at same time is redundant, anyways we can turn off one of these by using a 0 coefficient.

We have a minor difference in the implementation of the update of the beta coefficient in comparison to the original paper as pointed here

There is one metric info/learner/default_policy/learner_stats/allreduce_latency that I don´t know what it refers to.

Since I choose to use a lib, I find myself asking what a coefficient or a particular metrics is indeed, once I haven’t implemented it myself, for example, the policy_loss metric it could be the surrogate loss without the minus signal or a scenario where the entropy_coeff was supposed to be a negative value. I think we could have a better documentation about all configs and metrics, and sure we can always dig into the code and look for it, but would be more convenient to have it. I hope this post helps a bit, complementary comments and corrections are welcome.

1 Like