I am posting this, because it may not be obvious for everyone and I also want to confirm if I am completely correct, maybe I am wrong or miss something. From original ppo paper we have this objective function we want to maximize:
Looking at the code at we can see that in rllib we calculate the following loss and I supose minimize it by gradient descent or similar method(I still haven’t figure out what optimizer is used and if we can change it, glad if anyone answer this):
The metric with the key info/learner/default_policy/learner_stats/total_loss
corresponds to the above expression, we also have:
info/learner/default_policy/learner_stats/policy_loss
that corresponds to:
info/learner/default_policy/learner_stats/vf_loss
that corresponds to:
info/learner/default_policy/learner_stats/entropy
that corresponds to:
info/learner/default_policy/learner_stats/kl
that corresponds to:
From config we have the following correspondence:
config["vf_loss_coeff"]
== c_1
config["entropy_coeff"]
== c_2
config["kl_coeff"]
== the initial value for beta
The use of KL penality is also mentioned in the original paper. Usually we have an implementation that either uses clip or KL penality(PPO-clip and PPO-penalty are terms are used in spinning-up documentation).
Rllib is flexible and provides a way to have an hybrid version that can use at same time a clipped surrogate loss and a regularization based on KL. Some may say that having both at same time is redundant, anyways we can turn off one of these by using a 0 coefficient.
We have a minor difference in the implementation of the update of the beta coefficient in comparison to the original paper as pointed here
There is one metric info/learner/default_policy/learner_stats/allreduce_latency
that I don´t know what it refers to.
Since I choose to use a lib, I find myself asking what a coefficient or a particular metrics is indeed, once I haven’t implemented it myself, for example, the policy_loss
metric it could be the surrogate loss without the minus signal or a scenario where the entropy_coeff
was supposed to be a negative value. I think we could have a better documentation about all configs and metrics, and sure we can always dig into the code and look for it, but would be more convenient to have it. I hope this post helps a bit, complementary comments and corrections are welcome.