DDPPO on CPU vs GPU: NaN values during training

I have DDPPO running to train an agent, accelerate SGD across multiple CPUs and GPUs.
AWS instance : p2.xlarge
AMI: Ubuntu 18 Deep learning AMI with pytorch , python 3.7
Training the agent without GPUs is successful. While training with workers on GPU generates the following error :


 File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/distributions/distribution.py", line 53, in __init__
(pid=6229)     raise ValueError("The parameter {} has invalid values".format(param))
(pid=6229) ValueError: The parameter logits has invalid values

I observe the nan values in forward pass while generating action and value.
I have tried to work with learning rate but that doesnt seem to help.
Looking for some helping hand to debug this!

Happy to help debugging! Could you visualize the loss function values on Tensorboard (just do tensorboard --logdir ~/ray_results/), so we can figure out which component went wrong? It would also be nice to also see the gradient norm.

Hi @michaelzhiluo Thanks for the help!

Here is the snap shot of policy loss. It seems something is going wrong…The loss values
image