@arturn this might be related to the issue here (not because of the algo, but there might be NaNs in the model and I had a similar issue in my PPO)
@MrDracoG From what you mention the NaNs in your weights might stem from very high losses/gradients. Did you observe any spikes in your losses?
The fact that you were able to mitigate the problem by decreasing lambda and lowering the clipping in the loss might point to very high advantages. Are you also training with very long episodes (and possibly "complete_episodes"
in batch_mode
hyperparameter)?
In regard to the squashed observation space: do you normalize your observations with RLlib’s MeanStdFilter
?