PPO gives "Infinity" value for kl and total_loss

LukasNothhelfer · September 28, 2021, 4:36pm

Can anyone tell me why Tune wrote Infinity in the result.json file here? And why does Ray RLlib output Infinity for the total_loss and kl?

"info": {
        "learner": {
            "default_policy": {
                "learner_stats": {
                    "cur_kl_coeff": 1.27888392906113,
                    "total_loss": Infinity,
                    "policy_loss": -0.003319403744814432,
                    "vf_loss": 0.18914333338538805,
                    "kl": Infinity,
                    "entropy": 0.06346649648404688
                },
                "model": {},
                "custom_metrics": {}
            }
        },
        "num_agent_steps_trained": 4118240
    }

I have a vf_loss and a policy_loss, so why is the total_loss Infinity?

michaelzhiluo · September 30, 2021, 11:31pm

Total Loss in RLLIb’s PPO includes Policy Loss (Surrogate Loss) , Value Loss (Value Function Loss), and the KL divergence loss. It looks like the KL divergence between new and old policy exploded to infinity.

LukasNothhelfer · October 1, 2021, 10:42am

Do you perhaps have a code reference that proves that the loss is calculated in such a way so that I can mark the answer as a solution?

mannyv · October 1, 2021, 10:55am

@LukasNothhelfer,

This may be the cause of your issue.

github.com/ray-project/ray

[rllib] The `kl_coeff` parameter can be infinity if `kl_target` is not finetuned

opened 12:51PM - 10 Sep 21 UTC

yangysc

bug triage

### What is the problem? Recently, I found that if we do not finetune the `kl_t…arget` paramter in `ppo_torch_policy.py`, and the `sample_kl` is always larger than `2 * kl_target`, then the `kl_coeff` would be always multiplied by `1.5`. It would be **infinity** if `sampled_kl` is always large for some tasks. So the total loss for ppo would also be infinity. https://github.com/ray-project/ray/blob/d314d0c10eb7677ff6638d94118a2caeba9af419/rllib/agents/ppo/ppo_torch_policy.py#L187-L194 ![image](https://user-images.githubusercontent.com/8980981/132855825-7bffc1e1-923a-4b19-8774-7a194a0754f4.png) ![image](https://user-images.githubusercontent.com/8980981/132855755-a47bd98b-d26a-4e12-91a3-d8efdb76135b.png) Though it is the user's duty to set proper parameters, maybe we should set a warning about it, or set the maximum `kl_coeff` value to help the user debug more easily? *Ray version and other system information (Python version, TensorFlow version, OS):* ### Reproduction (REQUIRED) Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have **no external library dependencies** (i.e., use fake or mock data / environments): If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script". - [ ] I have verified my script runs in a clean environment and reproduces the issue. - [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

mannyv · October 1, 2021, 2:59pm

@LukasNothhelfer you can try is to set kl_coeff to 0.0 then the kl term will not be used in the loss.

avnishn · October 1, 2021, 9:00pm

This is likely the correct answer. Typically with PPO you don’t want to use a KL penalty. That is why the original PPO authors, authored PPO1 and PPO2.

PPO2 uses max entropy rewards to achieve something similar to the KL penalty, but the entropy coefficient that is used to control the effect of max entropy rewards is much less brittle than the KL penalty.

You can read more about max entropy rl here:

Topic		Replies	Views
~~Possible PPO surrogate policy loss sign error~~ RLlib	2	789	October 4, 2022
Breakdown of config and metrics of PPO implementation RLlib	0	673	February 23, 2022
Diffrences between the PPO implementation and the origonal PPO paper RLlib	6	879	May 16, 2021
Tradeoff between: clipped surrogate objective - adaptive KL-penalty coefficient RLlib	3	796	December 9, 2021
PPO training, kl loss divergence and stability problems RLlib	0	33	March 19, 2025

PPO gives "Infinity" value for kl and total_loss

Related topics