PPO gives "Infinity" value for kl and total_loss

Can anyone tell me why Tune wrote Infinity in the result.json file here? And why does Ray RLlib output Infinity for the total_loss and kl?

"info": {
        "learner": {
            "default_policy": {
                "learner_stats": {
                    "cur_kl_coeff": 1.27888392906113,
                    "total_loss": Infinity,
                    "policy_loss": -0.003319403744814432,
                    "vf_loss": 0.18914333338538805,
                    "kl": Infinity,
                    "entropy": 0.06346649648404688
                "model": {},
                "custom_metrics": {}
        "num_agent_steps_trained": 4118240

I have a vf_loss and a policy_loss, so why is the total_loss Infinity?

Total Loss in RLLIb’s PPO includes Policy Loss (Surrogate Loss) , Value Loss (Value Function Loss), and the KL divergence loss. It looks like the KL divergence between new and old policy exploded to infinity.

Do you perhaps have a code reference that proves that the loss is calculated in such a way so that I can mark the answer as a solution?


This may be the cause of your issue.

@LukasNothhelfer you can try is to set kl_coeff to 0.0 then the kl term will not be used in the loss.

This is likely the correct answer. Typically with PPO you don’t want to use a KL penalty. That is why the original PPO authors, authored PPO1 and PPO2.

PPO2 uses max entropy rewards to achieve something similar to the KL penalty, but the entropy coefficient that is used to control the effect of max entropy rewards is much less brittle than the KL penalty.

You can read more about max entropy rl here:

1 Like