Nan in train_batch[SampleBatch.ACTION_LOGP]

vakker00 · May 18, 2021, 5:20pm

I’m training a Pytorch based RL agent and after some time the gradients become nan.
I narrowed it down to the issue that an element in train_batch[SampleBatch.ACTION_LOGP] becomes nan, i.e.:

ipdb> train_batch[SampleBatch.ACTION_LOGP]
tensor([-2.2935, -2.1049, -6.6760, -2.9978, -3.4345, -2.3617, -7.0113, -2.2205,
        -2.7104, -2.4619, -2.6365, -4.2575, -1.9757, -2.1059, -3.3431, -4.4665,
        -2.4221, -2.0094, -2.5488, -3.4173, -3.6056, -3.2375, -2.1103, -1.9414,
        -5.7886, -2.8885, -2.7877, -1.9022, -5.3155, -4.7475, -1.7011, -2.7406,
        -3.8260, -5.1334, -3.9586, -1.7644, -3.9647, -3.0025, -2.3528, -2.3174,
        -2.4714, -2.8532, -2.3275, -2.1873, -2.3088, -2.7994, -8.2306, -3.3107,
        -2.0088, -3.3923, -2.5899, -2.0405, -1.8622, -2.3435, -1.9985, -2.9948,
        -1.4785, -3.2709, -4.4852, -2.0569, -6.1308, -3.3117, -1.9325, -2.4115,
        -2.0660, -1.9637, -4.5762, -2.5048, -1.8225, -1.6604, -2.2738, -1.9422,
        -3.6144, -2.9782, -1.8503, -2.2936, -2.9227, -2.6610, -4.4914, -2.4797,
        -4.8016, -3.0306, -2.6312, -3.3183, -1.8548, -2.0372, -1.9313, -1.9252,
        -3.2634, -2.9477, -3.6409, -1.7080, -2.6747, -2.1803, -2.4812, -2.2985,
        -2.1640, -2.9551, -2.7904, -2.7588, -2.4371, -2.8985, -2.0045,     nan,
        -5.1289, -2.1477, -3.5149, -1.9273, -2.8681, -2.8987, -4.2561, -2.6884,
        -2.7404, -2.1192, -6.0505, -3.0273, -3.1711, -3.4646, -2.6108, -3.6579,
        -4.1734, -2.0386, -3.0869, -3.0038, -2.3043, -1.8855, -1.9970, -2.1687],
       device='cuda:0')

which causes the nan value to propagate from the ppo_surrogate_loss (here).

Any suggestions how to proceed from here? It’s not clear to me where it’s being set (e.g. maybe here?). Everything seems to be normal in terms of observations and the model weights prior to this happening. Also, the gradients are not vanishing nor exploding, so that’s not the issue.

sven1977 · May 19, 2021, 3:31pm

Hey @vakker00 , looks like your weights explode at some point after some training update.

You could debug as follows:

In torch_policy.py, print out (or assert) the loss values inside the learn_on_batch() method. You have to find the very first step that causes the weights to go from good numbers to NaN.

My guess is that the gradients become very large at some point causing the update to mess up the weights. Are you using grad_clip=~10.0 or some similar value?

vakker00 · May 21, 2021, 9:24am

Thanks @sven, looking more into it the loss does seem to suddenly diverge.
I’m using very low LR (0.00001) and I tried grad_clip 10 and also 0.5, but it’s giving the same behaviour.

I’m wondering if this is a hyperparam issue, or there’s something more fundamental with the model that I’m using. Other models, e.g. a simple MLP doesn’t diverge, but it also has significantly less parameters.

vakker00 · June 1, 2021, 10:53am

So printing out the loss, sum of parameters in the model and sum of gradients I don’t see any sudden jump:

...
(pid=76412) Loss [tensor(0.3096, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61252.8672, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1962.5626, device='cuda:0')
(pid=76412) Loss [tensor(0.1924, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61253.2031, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2287.5579, device='cuda:0')
(pid=76412) Loss [tensor(0.3172, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61253.5586, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2580.5706, device='cuda:0')
(pid=76412) Loss [tensor(0.2380, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61253.9102, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1895.9375, device='cuda:0')
(pid=76412) Loss [tensor(0.2722, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61254.2617, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1863.9541, device='cuda:0')
(pid=76412) Loss [tensor(0.1469, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61254.6055, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2085.6831, device='cuda:0')
(pid=76412) Loss [tensor(0.2103, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61254.9609, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1633.0918, device='cuda:0')
(pid=76412) Loss [tensor(0.3006, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61255.3242, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(3317.4985, device='cuda:0')
(pid=76412) Loss [tensor(0.0787, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61255.6562, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(1889.7074, device='cuda:0')
(pid=76412) Loss [tensor(0.3359, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61256.0039, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2098.6243, device='cuda:0')
(pid=76412) Loss [tensor(0.3528, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61256.3320, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2840.6421, device='cuda:0')
(pid=76412) Loss [tensor(0.0684, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61256.6562, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2535.8950, device='cuda:0')
(pid=76412) Loss [tensor(0.1024, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61256.9883, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(3132.8201, device='cuda:0')
(pid=76412) Loss [tensor(0.2057, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61257.3281, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2327.9417, device='cuda:0')
(pid=76412) Loss [tensor(0.2255, device='cuda:0', grad_fn=<MeanBackward0>)]
(pid=76412) Total tensor(61257.6680, device='cuda:0', grad_fn=<AddBackward0>)
(pid=76412) Total grad tensor(2011.9187, device='cuda:0')
(pid=76412) Loss [tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)]

So I’m still not fully sure if it’s a usual exploding gradients problem.

vakker00 · July 2, 2021, 9:37am

I found the issue, which might be useful for others.

My policy network didn’t have a final activation layer so the values could get very high and very low (even without exploding gradients).

The (Torch) action distribution is constructed as follows:

self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))

So it is very easy (because of the exp) to overflow and get a distribution with inf standard deviation, and also to underflow and get one with 0 std (which results in ValueError: The parameter scale has invalid values).

hridayns · July 8, 2021, 2:33am

Hi, I am having a similar (the same maybe?) issue using continuous actions on PPO TF policy, and that is using DiagGaussian distribution which sounds similar to what you have described. I was not able to catch a single nan in the action_logp like you have for some reason. I couldn’t quite understand how to solve the issue. This is what I understood:

PPO has an actor output (mean and log_std values for the action distribution), and a separate critic output (value of action chosen). We use this mean and log_std to define a distribution from which we can sample actions. So what I think you are saying is to add an activation function for the output of the actor, that is an activation layer over the existing “mean” and “log_std” values? For example, a tanh?

vakker00 · July 8, 2021, 8:48am

so what’s happening then? The loss becomes nan suddenly?
For me it helped to step through the loss function to see where exactly the problem originates. It might be different for you though.

But in general yes, I added a tanh layer after the final linear layer and that bounds the output (which goes into the DiagGaussian).

To be honest, I’m not sure about the theoretical implications of this change, so you might get different convergence, stability, etc. Also, in my case this makes sense, the actions are between +/-1, but it might not be sensible in general.

hridayns · July 8, 2021, 9:22am

Hello, since I am unable to catch the nan error’s exact step, I relied on the progress.csv file. It seems that the total_loss becomes inf or nan, and so does the KL and entropy at the same time. There are four error terms that make up the total loss:

total_loss = MEAN(-surrogate_loss +
policy.kl_coeff * action_kl + policy.config[“vf_loss_coeff”] * vf_loss -
policy.entropy_coeff * curr_entropy)

I have copied the ppo_tf_policy.py file and am using that class as the policy class so I can change the file as per your suggestion for debugging. I’m not sure how to single out the nan/inf in question. Should I just assert not np.isnan() on all variables related to these losses - This is mainly an issue for me because I can’t use this unless I run Tensorflow in Eager mode where execution is possible without using a computation graph. So assert stuff like action_logps, advantages, and anything used in calculation for those losses?

UPDATE

There seems to be something up with the surrogate_loss term, since I removed every other term individually to check if I still got the nan/inf after a few steps. But only upon removing surrogate_loss term from the total_loss calculation have I still not got the nan/inf error.

Topic		Replies	Views
PPO nan in actor logits RLlib	7	689	October 1, 2024
PPO Training Error: NaN Values in Gradients and Near-Zero Loss RLlib	6	277	September 3, 2024
Error: nan Tensors in PyTorch with Ray RLlib for MARL RLlib	12	1180	August 10, 2024
ValueError: Expected parameter logits (...) to satisfy the constraint IndependentConstraint(Real(), 1) RLlib	38	9005	October 14, 2024
Nan in the policy network after training for longer duration Configure Algorithm, Training, Evaluation, Scaling	0	256	October 13, 2023

Nan in train_batch[SampleBatch.ACTION_LOGP]

Related topics