PPO nan in actor logits

Hi @tlaurie99,

Welcome to the forum.

If I had to venture a guess as to where the Nan’s originate it would be here:

250 self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))

It has been my experience that when using a continuous action space sometime during training the std logits that parameterize the action distribution can become very negative. Which leads to an std close to zero which causes a Nan when it divides by ~zero on the backward calculation of the normal distribution.

RLLIB is unique the popular frameworks in that it uses the nn policy to generate the log_std values.

If you look at cleanrl or sb3s implementation you will see that they register the log_std as a parameter of the model so they can be learned but not as part of the nn layers.

Since you are already using a custom model you might try implementing this alternative to see if it helps.

cleanrl:

sb3:

2 Likes