ValueError: Expected parameter logits (...) to satisfy the constraint IndependentConstraint(Real(), 1)

Today I got a very strange problem with Ray RLlib (ray version 2.0.0dev0). The code that I am running all the time is throwing an error since today but it worked all the time. I get this error:

File "/opt/conda/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 580, in compute_gradients
    [postprocessed_batch])
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 1054, in _multi_gpu_parallel_grad_calc
    raise last_result[0] from last_result[1]
ValueError: Expected parameter logits (Tensor of shape (70, 6)) of distribution Categorical(logits: torch.Size([70, 6])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],

 (... more nan lines here ...)

        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan]], grad_fn=<SubBackward0>)
In tower 0 on device cpu

My script is only using cpus, I am not configuring gpus in any way. I only want to use cpus. I dont even understand what is happening here so I would be glad for any help.

Hi @LukasNothhelfer ,

I cannot say anything without code, so I am only making a tiny guess: Are using an RNN?

Best

Hi @Lars_Simon_Zehnder
I am using a custom rnn model for the PPO algorithm of rllib. The error happens in ray 1.8 and ray 2.0. I have a custom environment and a custom model. The exception is thrown in this line:

The error happens rarely on my local laptop but consistently on the Kubernetes cluster since yesterday. My remote setup is as follows: I create a Kubernetes container and in that container I start ray, so that I use ray exclusively in that container and not across multiple containers. I’ve had to do it this way so far and it has always worked well. Unfortunately, the whole thing is now very difficult to debug, I’ve been looking for the problem for 2 days.
Fortunately, on my local laptop the problem occurred once and I was able to figure out that there seems to be a problem in this line:

I’m not entirely sure about that. I have debugged my environment a hundred times, as well as my model and the model’s outputs. I am just wondering why NaNs are occurring here?

There is an issue with RNNs in the newest version of ray. Maybe this related.

Try to rerun with "simple_optimizer"=True in your config.

Furthermore try debugging through an entire trainer iteration from sampling to training (in VSCode this works for example by setting a break point and from thereon just click the down arrow). Somewhere these 'NaN`s get produced.

Hi @Lars_Simon_Zehnder
I had a similar problem these days and "simple_optimizer=True" solved my issue. See here. But its not working in this case. I have this problem with ray2.0 and ray1.8. Can you tell me in which version it does work?
I know how to debug scripts. But the error does not happen on my local computer although it uses the same ray version as my remote container setup. My training starts like this:

tune.run(
    "PPO",
    config=tune.grid_search(configs),
    verbose=1,
    num_samples=1, 
    checkpoint_freq=0, 
    checkpoint_at_end=False, 
    local_dir=args.output_folder,
...
    )

That means, I dont have a custom training loop.

See also @mannyv 's answer in another topic. You also need to set sgd_minibatch_size" > "max_seq_len". As I cannot see your code, its hard to make remote guesses.

I would - no matter if custom or default - debug a whole trainer/sampler iteration to see what happens after evaluation/training with the logits. You say you use a custom RNN? Either it outputs already the NaNs or it happens somewhere after in the Trainer/Sampler iteration. Somewhere these values must occur. When running on Kubernetes you might be able to execute the code in a single container and use "local_mode"=True. You could also use a local Minikube and see if you can replicate the error there

Hi @Lars_Simon_Zehnder
I am approaching the problem. Namely, I located the bug in my own model’s file. Somewhere in the constructor of the model I have this:

...

self.lin1 = nn.Linear(
            in_features=in_features,
            out_features=self.first_hidden_size,
            bias=True
        )
...

And in my forward_rnn function I have something like this:

@override(RecurrentNetwork)
    def forward_rnn(self, inputs, state, seq_lens):
    ...
    lin_out = self.lin1(inputs)
    ...

The problem is that self.lin1 produces the NaNs. I checked the input tensor inputs for NaNs, but there are none: inputs.isnan().any() gives tensor(False). inputs looks like this

tensor([[[0.2659, 0.2401, 0.2857,  ..., 0.3055, 0.2429, 1.0000],
         [0.2647, 0.2359, 0.2884,  ..., 0.2962, 0.2426, 1.0000],
         [0.2637, 0.2359, 0.2879,  ..., 0.2858, 0.2411, 1.0016],
         ...,
         [0.2851, 0.2547, 0.3110,  ..., 0.2612, 0.2293, 0.8990],
         [0.2890, 0.2606, 0.3134,  ..., 0.2636, 0.2321, 0.8940],
         [0.2908, 0.2620, 0.3144,  ..., 0.2680, 0.2360, 0.8934]],
        [[0.2236, 0.1921, 0.2468,  ..., 0.2299, 0.2209, 1.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]])

with inputs.shape equals torch.Size([2, 55, 23])
The linear layer topology is like this: Linear(in_features=23, out_features=16, bias=True). The shape of the output sensor is correct: lin_out.shape gives torch.Size([2, 55, 16]) but it is full of NaNs:

tensor([[[nan, nan, nan,  ..., nan, nan, nan],
[nan, nan, nan,  ..., nan, nan, nan],
[nan, nan, nan,  ..., nan, nan, nan],
...,
[nan, nan, nan,  ..., nan, nan, nan],
[nan, nan, nan,  ..., nan, nan, nan],
[nan, nan, nan,  ..., nan, nan, nan]],

[[nan, nan, nan,  ..., nan, nan, nan],
[nan, nan, nan,  ..., nan, nan, nan],
[nan, nan, nan,  ..., nan, nan, nan],
...,
[nan, nan, nan,  ..., nan, nan, nan],
[nan, nan, nan,  ..., nan, nan, nan],
[nan, nan, nan,  ..., nan, nan, nan]]], grad_fn=<AddBackward0>)

Currently, I suspect the device configuration. I will check if torch really tries to calculate everything on the CPU or if torch somehow tries to push the tensors to an unconfigured GPU. Thx for any help so far.

UPDATE:
No, both tensors are on the cpu. This is configured correctly.

I have investigated the problem further. As it looks, the weights in the linear layer are all NaN. So the problem lies somewhere in the update step of the weights of the layer.

@LukasNothhelfer,

Your next step should probably be to start tensorboard and watch the learning stats.

Is it coming from the vf_loss, policy_loss, Entropy, kl…?

Hi @mannyv
I checked all these statistics already. None of them has an infinity or nan value during training. This is very confusing and pretty hard to debug since it is all running in a kubernetes container remotely and I cant reproduce the error on my local laptop even though the runtime context is exactly the same.

I wish there was a possibility to track gradients of the network easily.

I also tried clip_grad = 100.0 and clip_grad = 10.0 but I get the same error over and over again.

@LukasNothhelfer ,

is this happening from the first iteration on? Have you checked than how the parameters in your linear layer are initialized? Are the inputs initialized?

Furthermore, try to decrease the learning rate further to avoid exploding gradients where NaN values can happen. Another thing to check is, if you implemented your own loss function and used a sqrt() somewhere (happens in MSE losses for example).

@mannyv, do you know how to use a callback to add a custom gradient norm metric to track the gradients in tensorboard? I once tried that out without success (was however a Conv2D layer with more complex gradient structure). This could help to see, if gradients explode.

It might be possible to add

for name, param in model.named_parameters():
    print(name, torch.isfinite(param.grad).all())

somewhere in the code, where the model is available.

@LukasNothhelfer , another guess is that if you normalize your inputs somwhere you might calculate statistics across the time dimension instead of across feature dimension.

Hope you solve your issue

@Lars_Simon_Zehnder
I will work on an workaround to track the gradient. My idea is to implement the callback on_learn_on_batch, copy my model in there and do the compute_gradients stuff on the copy without affecting the original model. I guess I will end up putting these metrics in global ray objects and accessing these objects in my loggers.

@LukasNothhelfer ,

from what I see in the TorchPolicy you should have a model from the policy in the callback and also the postprocessed batch. Then you can calculate the gradients via the compute_gradients() method from the policy passing it the postprocessed batch. This should have no influence on training (next to performance) as gradients are not applied by you (calling the method apply_gradients() in the TorchPolicy class), correct me, if I am wrong @mannyv .

The result dictionary than allows you to add the gradient norms (must be scalar) to the metrics that get written out by the logger and you can visualize these metrics directly in TensorBoard.

@LukasNothhelfer,

You would also want to look for very large spikes in those values.

Have you tried disabling the lstm? Do you still get the nan’s if you comment out the lstm call and just pass through the linear layers?

If you can post a copy of your custom model I would take a lookto see if I notice anything off.

You also should check your inputs and see if you are generating any extremely large input values. That can cause overflows too.

@mannyv My inputs are normalized to range(-1,1) (using sklearn MinmaxScaler). Here is my custom network. Dont worry about this Time2Vec stuff, its not used in my scenario (use_time2vec=False):

from ray.rllib.utils.annotations import override
from ray.rllib.models.torch.recurrent_net import RecurrentNetwork
from ray.rllib.models.modelv2 import ModelV2
import torch.nn as nn
import torch

import os, sys
sys.path.append(os.path.dirname(os.path.dirname(__file__)))
from time2vec.time2vec import Time2Vec

class PPOModel(RecurrentNetwork, nn.Module):
    def __init__(self, 
                 obs_space, 
                 action_space, 
                 num_outputs,
                 model_config, 
                 name):
        nn.Module.__init__(self)
        super().__init__(obs_space, 
                         action_space, 
                         num_outputs, 
                         model_config, 
                         name)
        
        # Netzwerk-Aufbau
        # Input Layer
        # First Hidden Layer
        # GELU
        # Dropout
        # Second Hidden Layer
        # GELU
        # Dropout
        # LSTM
        # Dropout
        #   Value (1)
        #   Policy(num_outputs) 

        # Modellkonfiguration einlesen
        _model = model_config["custom_model_config"]["model"]
        self.first_hidden_size = _model["first_hidden_size"]
        self.second_hidden_size = _model["second_hidden_size"]
        self.lstm_hidden_size = _model["lstm_hidden_size"]
        self.dropout_p = _model["dropout_p"]
        
        # Konfiguration für Time2Vec
        _time2vec = model_config["custom_model_config"]["time2vec"]
        self.use_time2vec = _time2vec["use_time2vec"]
        self.k = _time2vec["k"]
        
        # Netzwerk definieren
        self.gelu = nn.GELU()
        self.time2vec = Time2Vec(self.k)
        in_features = obs_space.shape[0] if not self.use_time2vec else \
            obs_space.shape[0] + self.k
        self.lin1 = nn.Linear(#in_features=obs_space.shape[0],
            in_features=in_features,
            out_features=self.first_hidden_size,
            bias=True
        )
        self.lin2 = nn.Linear(
            in_features=self.first_hidden_size,
            out_features=self.second_hidden_size,
            bias=True
        )
        self.lstm = nn.LSTM(
            input_size=self.second_hidden_size,
            hidden_size=self.lstm_hidden_size,
            num_layers=1,
            bias=True,
            batch_first=True,
            dropout=0.0,
            bidirectional=False,
            proj_size=0
        )
        self.dropout = nn.Dropout(p=self.dropout_p)
        self.vf = nn.Linear(
            in_features=self.lstm_hidden_size,
            out_features=1,
            bias=True
        )
        self.pi = nn.Linear(
            in_features=self.lstm_hidden_size,
            out_features=num_outputs,
            bias=True
        )
        self.value = None

    @override(ModelV2)
    def get_initial_state(self):
        h0, c0 = torch.zeros(self.lstm_hidden_size), \
            torch.zeros(self.lstm_hidden_size)
        return [h0, c0]
     
    @override(RecurrentNetwork)
    def forward_rnn(self, inputs, state, seq_lens):
        assert not inputs.isnan().any(), "NaN in inputs in forward"
        h0 = torch.unsqueeze(state[0], 0)
        c0 = torch.unsqueeze(state[1], 0)
        
        # Time2Vec
        if self.use_time2vec:
            
            # Der Zeitstempel steckt immer im letzten Feature
            times = inputs[:,:,-1].unsqueeze(2)
            t2v = self.time2vec(times)
            inp = torch.cat((inputs[:,:,0:-1], t2v), dim=2)

        # Wenn kein Time2Vec verwendet wird, dann beinhalten die Inputs hier
        # auch keine Time2Vec-Zeitstempel (das ist bei der Trendvorhersage
        # anders)
        else:
            inp = inputs
        
        gelu1 = self.dropout(self.gelu(self.lin1(inp)))
        gelu2 = self.dropout(self.gelu(self.lin2(gelu1)))

        output, (h_n, c_n) = self.lstm(gelu2, (h0, c0))
        output = self.dropout(output) # Dropout

        logits = self.pi(output)
        self.value = self.vf(output).flatten()

        return logits, [torch.squeeze(h_n, 0), torch.squeeze(c_n, 0)]
    
    @override(ModelV2)    
    def value_function(self):
        return self.value

@LukasNothhelfer

That looks fine. I do not see anything obvious. I was concerned at first with reusing the same dropout layer in multiple places but turns out that should be fine since it uses the functional api inside.

@mannyv
I didnt even think about that since dropout does not have any learnable weights. But I ll keep that in mind. I ll dick further into my problem now.

@mannyv @Lars_Simon_Zehnder
I overwrote the on_learn_on_batch method:

def on_learn_on_batch(self, *,
                          policy,
                          train_batch,
                          result,
                          **kwargs) -> None:

        grads, fetches = policy.compute_gradients(train_batch)
        for name, param in policy.model.named_parameters():
            if "time2vec" in name: # Skip time2vec, we dont use this here
                continue
            _isfinite = torch.isfinite(param.grad).all().item()
            if not _isfinite:
                print(f"grad for {name} not finite\n{param.grad}")
                print(f"{name}/grad/max={param.grad.max()}")
                print(f"{name}/grad/min={param.grad.min()}")
                print(f"{name}/grad/isfinite={_isfinite}")
                raise Exception(f"grad for {name} is not finite")

And this gives me this information:

(PPO pid=94) grad for lin1.weight not finite
(PPO pid=94) tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
(PPO pid=94)         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])
(PPO pid=94) lin1.weight/grad/max=nan
(PPO pid=94) lin1.weight/grad/min=nan
(PPO pid=94) lin1.weight/grad/isfinite=False

The question is: What the hack is happening here? Looks like I have to dick into the backpropagation process now to see what is happening in there…
But first I ll check if the weights are initialized.

UPDATE:
The weights get initialized correctly. Looks like the problem is in the backpropagation. I ll copy the whole code from compute_gradients and _multi_gpu_parallel_grad_calc into my on_learn_on_batch function to make some custom debug outputs.