PPO gives "Infinity" value for kl and total_loss

avnishn · October 1, 2021, 9:00pm

This is likely the correct answer. Typically with PPO you don’t want to use a KL penalty. That is why the original PPO authors, authored PPO1 and PPO2.

PPO2 uses max entropy rewards to achieve something similar to the KL penalty, but the entropy coefficient that is used to control the effect of max entropy rewards is much less brittle than the KL penalty.

You can read more about max entropy rl here:

Topic		Replies	Views
~~Possible PPO surrogate policy loss sign error~~ RLlib	2	812	October 4, 2022
Breakdown of config and metrics of PPO implementation RLlib	0	711	February 23, 2022
Diffrences between the PPO implementation and the origonal PPO paper RLlib	6	904	May 16, 2021
Tradeoff between: clipped surrogate objective - adaptive KL-penalty coefficient RLlib	3	837	December 9, 2021
PPO training, kl loss divergence and stability problems RLlib	0	139	March 19, 2025

PPO gives "Infinity" value for kl and total_loss

Related topics