I looked into it more and it seems like NaNs are introduced
via the gradients. The gradients become NaNs on the backward call in ray.rllib.policy.torch_policy_v2._multi_gpu_parallel_grad_calc. More specifically from the loss_out[opt_idx].backward(retain_graph=True)
call which ultimately calls Variable._execution_engine.run_backward
( which calls out to the torch C extension ). Then, the NaN values are passed along to the model parameters in the torch.optim.Adam.step method.
I think this means that I have run away gradients. I am not entirely sure though. From everything I have seen it would make sense to me that this is the case. I am going to look more into how to solve the runaway gradient problem. Maybe with gradient clipping, smaller learning rates, etc and see if I can prevent the NaNs from coming back.