Nan in train_batch[SampleBatch.ACTION_LOGP]

I’m training a Pytorch based RL agent and after some time the gradients become nan.
I narrowed it down to the issue that an element in train_batch[SampleBatch.ACTION_LOGP] becomes nan, i.e.:

ipdb> train_batch[SampleBatch.ACTION_LOGP]
tensor([-2.2935, -2.1049, -6.6760, -2.9978, -3.4345, -2.3617, -7.0113, -2.2205,
        -2.7104, -2.4619, -2.6365, -4.2575, -1.9757, -2.1059, -3.3431, -4.4665,
        -2.4221, -2.0094, -2.5488, -3.4173, -3.6056, -3.2375, -2.1103, -1.9414,
        -5.7886, -2.8885, -2.7877, -1.9022, -5.3155, -4.7475, -1.7011, -2.7406,
        -3.8260, -5.1334, -3.9586, -1.7644, -3.9647, -3.0025, -2.3528, -2.3174,
        -2.4714, -2.8532, -2.3275, -2.1873, -2.3088, -2.7994, -8.2306, -3.3107,
        -2.0088, -3.3923, -2.5899, -2.0405, -1.8622, -2.3435, -1.9985, -2.9948,
        -1.4785, -3.2709, -4.4852, -2.0569, -6.1308, -3.3117, -1.9325, -2.4115,
        -2.0660, -1.9637, -4.5762, -2.5048, -1.8225, -1.6604, -2.2738, -1.9422,
        -3.6144, -2.9782, -1.8503, -2.2936, -2.9227, -2.6610, -4.4914, -2.4797,
        -4.8016, -3.0306, -2.6312, -3.3183, -1.8548, -2.0372, -1.9313, -1.9252,
        -3.2634, -2.9477, -3.6409, -1.7080, -2.6747, -2.1803, -2.4812, -2.2985,
        -2.1640, -2.9551, -2.7904, -2.7588, -2.4371, -2.8985, -2.0045,     nan,
        -5.1289, -2.1477, -3.5149, -1.9273, -2.8681, -2.8987, -4.2561, -2.6884,
        -2.7404, -2.1192, -6.0505, -3.0273, -3.1711, -3.4646, -2.6108, -3.6579,
        -4.1734, -2.0386, -3.0869, -3.0038, -2.3043, -1.8855, -1.9970, -2.1687],
       device='cuda:0')

which causes the nan value to propagate from the ppo_surrogate_loss (here).

Any suggestions how to proceed from here? It’s not clear to me where it’s being set (e.g. maybe here?). Everything seems to be normal in terms of observations and the model weights prior to this happening. Also, the gradients are not vanishing nor exploding, so that’s not the issue.

Hey @vakker00 , looks like your weights explode at some point after some training update.

You could debug as follows:

In torch_policy.py, print out (or assert) the loss values inside the learn_on_batch() method. You have to find the very first step that causes the weights to go from good numbers to NaN.

My guess is that the gradients become very large at some point causing the update to mess up the weights. Are you using grad_clip=~10.0 or some similar value?

Thanks @sven, looking more into it the loss does seem to suddenly diverge.
I’m using very low LR (0.00001) and I tried grad_clip 10 and also 0.5, but it’s giving the same behaviour.

I’m wondering if this is a hyperparam issue, or there’s something more fundamental with the model that I’m using. Other models, e.g. a simple MLP doesn’t diverge, but it also has significantly less parameters.

So printing out the loss, sum of parameters in the model and sum of gradients I don’t see any sudden jump:

...
(pid=76412) Loss [tensor(0.3096, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61252.8672, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1962.5626, device='cuda:0')
(pid=76412) Loss [tensor(0.1924, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61253.2031, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2287.5579, device='cuda:0')
(pid=76412) Loss [tensor(0.3172, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61253.5586, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2580.5706, device='cuda:0')
(pid=76412) Loss [tensor(0.2380, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61253.9102, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1895.9375, device='cuda:0')
(pid=76412) Loss [tensor(0.2722, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61254.2617, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1863.9541, device='cuda:0')
(pid=76412) Loss [tensor(0.1469, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61254.6055, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2085.6831, device='cuda:0')
(pid=76412) Loss [tensor(0.2103, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61254.9609, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1633.0918, device='cuda:0')
(pid=76412) Loss [tensor(0.3006, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61255.3242, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(3317.4985, device='cuda:0')
(pid=76412) Loss [tensor(0.0787, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61255.6562, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1889.7074, device='cuda:0')
(pid=76412) Loss [tensor(0.3359, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61256.0039, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2098.6243, device='cuda:0')
(pid=76412) Loss [tensor(0.3528, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61256.3320, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2840.6421, device='cuda:0')
(pid=76412) Loss [tensor(0.0684, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61256.6562, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2535.8950, device='cuda:0')
(pid=76412) Loss [tensor(0.1024, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61256.9883, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(3132.8201, device='cuda:0')
(pid=76412) Loss [tensor(0.2057, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61257.3281, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2327.9417, device='cuda:0')
(pid=76412) Loss [tensor(0.2255, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61257.6680, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2011.9187, device='cuda:0')
(pid=76412) Loss [tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)]

So I’m still not fully sure if it’s a usual exploding gradients problem.

I found the issue, which might be useful for others.

My policy network didn’t have a final activation layer so the values could get very high and very low (even without exploding gradients).

The (Torch) action distribution is constructed as follows:

self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))

So it is very easy (because of the exp) to overflow and get a distribution with inf standard deviation, and also to underflow and get one with 0 std (which results in ValueError: The parameter scale has invalid values).

1 Like

Hi, I am having a similar (the same maybe?) issue using continuous actions on PPO TF policy, and that is using DiagGaussian distribution which sounds similar to what you have described. I was not able to catch a single nan in the action_logp like you have for some reason. I couldn’t quite understand how to solve the issue. This is what I understood:

PPO has an actor output (mean and log_std values for the action distribution), and a separate critic output (value of action chosen). We use this mean and log_std to define a distribution from which we can sample actions. So what I think you are saying is to add an activation function for the output of the actor, that is an activation layer over the existing “mean” and “log_std” values? For example, a tanh?

so what’s happening then? The loss becomes nan suddenly?
For me it helped to step through the loss function to see where exactly the problem originates. It might be different for you though.

But in general yes, I added a tanh layer after the final linear layer and that bounds the output (which goes into the DiagGaussian).

To be honest, I’m not sure about the theoretical implications of this change, so you might get different convergence, stability, etc. Also, in my case this makes sense, the actions are between +/-1, but it might not be sensible in general.

Hello, since I am unable to catch the nan error’s exact step, I relied on the progress.csv file. It seems that the total_loss becomes inf or nan, and so does the KL and entropy at the same time. There are four error terms that make up the total loss:

total_loss = MEAN(-surrogate_loss +
policy.kl_coeff * action_kl + policy.config[“vf_loss_coeff”] * vf_loss -
policy.entropy_coeff * curr_entropy)

I have copied the ppo_tf_policy.py file and am using that class as the policy class so I can change the file as per your suggestion for debugging. I’m not sure how to single out the nan/inf in question. Should I just assert not np.isnan() on all variables related to these losses - This is mainly an issue for me because I can’t use this unless I run Tensorflow in Eager mode where execution is possible without using a computation graph. So assert stuff like action_logps, advantages, and anything used in calculation for those losses?

UPDATE

There seems to be something up with the surrogate_loss term, since I removed every other term individually to check if I still got the nan/inf after a few steps. But only upon removing surrogate_loss term from the total_loss calculation have I still not got the nan/inf error.