PPO gives "Infinity" value for kl and total_loss

This is likely the correct answer. Typically with PPO you don’t want to use a KL penalty. That is why the original PPO authors, authored PPO1 and PPO2.

PPO2 uses max entropy rewards to achieve something similar to the KL penalty, but the entropy coefficient that is used to control the effect of max entropy rewards is much less brittle than the KL penalty.

You can read more about max entropy rl here: