I think there is an error in the PPO clipped surrogate loss and would appreciate if someone else could take a look and let me know if I might be missing something. This affects both the tf and torch versions.
If you look at the code snippet above from master. You will see that line 139 computes the mean of the negative surrogate loss. But, on line 175 when combining the surrogate, vf, and entropy losses to get the total_loss it is negated again, which I think is incorrect.
I checked the cleanrl implementation and observed that they compute it as I expected with only one negative.
as does stable baselines:
and although it is harder to track so does seedrl (I think):
@Lars_Simon_Zehnder pointed out that mean_policy_loss is used for metrics and surrogate_loss is used for the loss.
I think then it is OK.
What threw me off is that most libraries I look at take the mean of each component separately and then combine them but rllib is combining them then taking the mean. So I was expecting the mean_policy_loss to be the one in total_loss and I did not read carefully enough.
Why is the mean re-computed in total_loss? Perhaps required for multi-gpu cases?