Can anyone tell me why Tune wrote Infinity
in the result.json
file here? And why does Ray RLlib output Infinity
for the total_loss
and kl
?
"info": {
"learner": {
"default_policy": {
"learner_stats": {
"cur_kl_coeff": 1.27888392906113,
"total_loss": Infinity,
"policy_loss": -0.003319403744814432,
"vf_loss": 0.18914333338538805,
"kl": Infinity,
"entropy": 0.06346649648404688
},
"model": {},
"custom_metrics": {}
}
},
"num_agent_steps_trained": 4118240
}
I have a vf_loss
and a policy_loss
, so why is the total_loss
Infinity
?
Total Loss in RLLIb’s PPO includes Policy Loss (Surrogate Loss) , Value Loss (Value Function Loss), and the KL divergence loss. It looks like the KL divergence between new and old policy exploded to infinity.
Do you perhaps have a code reference that proves that the loss is calculated in such a way so that I can mark the answer as a solution?
mannyv
October 1, 2021, 10:55am
4
@LukasNothhelfer ,
This may be the cause of your issue.
opened 12:51PM - 10 Sep 21 UTC
bug
triage
### What is the problem?
Recently, I found that if we do not finetune the `kl_t… arget` paramter in `ppo_torch_policy.py`,
and the `sample_kl` is always larger than `2 * kl_target`, then the `kl_coeff` would be always multiplied by `1.5`. It would be **infinity** if `sampled_kl` is always large for some tasks. So the total loss for ppo would also be infinity.
https://github.com/ray-project/ray/blob/d314d0c10eb7677ff6638d94118a2caeba9af419/rllib/agents/ppo/ppo_torch_policy.py#L187-L194


Though it is the user's duty to set proper parameters, maybe we should set a warning about it, or set the maximum `kl_coeff` value to help the user debug more easily?
*Ray version and other system information (Python version, TensorFlow version, OS):*
### Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have **no external library dependencies** (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
- [ ] I have verified my script runs in a clean environment and reproduces the issue.
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).
mannyv
October 1, 2021, 2:59pm
5
@LukasNothhelfer you can try is to set kl_coeff
to 0.0 then the kl term will not be used in the loss.
1 Like
This is likely the correct answer. Typically with PPO you don’t want to use a KL penalty. That is why the original PPO authors, authored PPO1 and PPO2.
PPO2 uses max entropy rewards to achieve something similar to the KL penalty, but the entropy coefficient that is used to control the effect of max entropy rewards is much less brittle than the KL penalty.
You can read more about max entropy rl here:
1 Like