I am posting this, because it may not be obvious for everyone and I also want to confirm if I am completely correct, maybe I am wrong or miss something. From original ppo paper we have this objective function we want to maximize:

Looking at the code at we can see that in rllib we calculate the following loss and I supose **minimize** it by gradient descent or similar method(I still haven’t figure out what optimizer is used and if we can change it, glad if anyone answer this):

The metric with the key `info/learner/default_policy/learner_stats/total_loss`

corresponds to the above expression, we also have:

`info/learner/default_policy/learner_stats/policy_loss`

that corresponds to:

`info/learner/default_policy/learner_stats/vf_loss`

that corresponds to:

`info/learner/default_policy/learner_stats/entropy`

that corresponds to:

`info/learner/default_policy/learner_stats/kl`

that corresponds to:

From config we have the following correspondence:

`config["vf_loss_coeff"]`

== c_1

`config["entropy_coeff"]`

== c_2

`config["kl_coeff"]`

== the initial value for beta

The use of KL penality is also mentioned in the original paper. Usually we have an implementation that either uses clip or KL penality(PPO-clip and PPO-penalty are terms are used in spinning-up documentation).

Rllib is flexible and provides a way to have an hybrid version that can use at same time a clipped surrogate loss and a regularization based on KL. Some may say that having both at same time is redundant, anyways we can turn off one of these by using a 0 coefficient.

We have a minor difference in the implementation of the update of the beta coefficient in comparison to the original paper as pointed here

There is one metric `info/learner/default_policy/learner_stats/allreduce_latency`

that I don´t know what it refers to.

Since I choose to use a lib, I find myself asking what a coefficient or a particular metrics is indeed, once I haven’t implemented it myself, for example, the `policy_loss`

metric it could be the surrogate loss without the minus signal or a scenario where the `entropy_coeff`

was supposed to be a negative value. I think we could have a better documentation about all configs and metrics, and sure we can always dig into the code and look for it, but would be more convenient to have it. I hope this post helps a bit, complementary comments and corrections are welcome.