Error: nan Tensors in PyTorch with Ray RLlib for MARL

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, I am using Ray RLlib to train a multi-agent reinforcement learning model on Python. The environment is self-made. The task is to train agents to fight against each other in two groups.

I am training my model in a curriculum learning fashion, but I don’t use the callback function to set a new task during training. I change the task manually (this is just my preference).

My agents use a shared policy to train. The agents (opponents) from the other group are not trained, but are hardcoded agents. However, in my current task level (level 3), I am doing a kind of self-play. Concretely, in this level 3, the agents as well as the opponents start with a copy of the policy obtained in level 2, but the agent’s policy will be updated during training, whereas the policy of the opponent is freezed, so it just does inference. And during training in this configuration, I sometimes get an error from PyTorch. I will list you the most important code snippets used. And below you will see the error code that appears after around 6000 epochs. I suspect it’s because of the rewards, which might be 0 for several episodes.

I don’t know if it is the right way to set-up self-play (for ray rllib), but in the environment, I basicly load the whole set-up as for training and restore the algorithm from level 2. See in the code ENV below. The BasicEnv() class is just a blank class to be able to set-up the algorithm inside the environment.

If more and detailed code is needed, let me know! I have Python 3.10.6, PyTorch 1.13.1 with cuda 11.7 and ray 2.0.0.

CODE:

config = {
    "multiagent": {
        "policies": {
            "shared_policy": PolicySpec(
                config={
                    "model": {
                        "fcnet_hiddens" : [HL1, HL2],
                        "fcnet_activation": "tanh",
                        "vf_share_layers": True,
                    }
                }
            )
        },
        "policy_mapping_fn": (
            lambda agent_id, episode, **kwargs: "shared_policy"
        ),
    },
    "train_batch_size": 2000,
    "rollout_fragment_length": 1000,
    "gamma": 0.99,
    "framework": "torch",
    "horizon": HORIZON,
    "lambda": 0.8,
    "clip_param": 0.2,
    "lr": 1e-4,
    "num_workers": 4,
    "num_gpus": 1,
}
algo = ppo.PPO(env=Dogfight, config=config)
algo.restore(path)
for i in range(10001):
        result = algo.train()

ENV:

class BasicEnv(MultiAgentEnv)
  def __init__(self,):
        super().__init__()

  def reset(self):
    pass
  
  def step():
    pass


class Dogfight(MultiAgentEnv)
  def __init__(self, env_config):
   super().__init__()
   ...
   self.algo = self.setup_ss()

  def setup_ss(self):
    ss_config = {
    "multiagent": {
        "policies": {
            "shared_policy": PolicySpec(
                config={
                    "model": {
                        "fcnet_hiddens" : [512, 512],
                        "fcnet_activation": "tanh",
                        "vf_share_layers": True,
                    }
                }
            )
        },
        "policy_mapping_fn": (
            lambda agent_id, episode, **kwargs: "shared_policy"
        ),
    },
    "train_batch_size": 2000,
    "rollout_fragment_length": 1000,
    "gamma": 0.99,
    "framework": "torch",
    "horizon": self.horizon,
    "lambda": 0.8,
    "clip_param": 0.2,
    "lr": 1e-4,
    "num_workers": 1,
    "explore": False
    }
    algo = ppo.PPO(env=BasicEnv, config=ss_config)
    algo.restore(self.restore_path)
    return algo

def step(self, actions):
  ...
  opponent_state = self.opponent_state() #returns observation from opponent perspective
  opponent_actions = self.algo.compute_single_action(observation=opponent_state, policy_id="shared_policy")
  ...

  *apply actions for agents and opponents*

ERROR:

Traceback (most recent call last):
  File "/home/ardianselmonaj/Projects/marl-warsim/train.py", line 847, in <module>
    start_training(algo)
  File "/home/ardianselmonaj/Projects/marl-warsim/train.py", line 809, in start_training
    result = algo.train()
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 347, in train
    result = self.step()
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 661, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2378, in _run_one_training_iteration
    num_recreated += self.try_recover_from_step_attempt(
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2190, in try_recover_from_step_attempt
    raise error
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2373, in _run_one_training_iteration
    results = self.training_step()
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 418, in training_step
    train_results = train_one_step(self, train_batch)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/execution/train_ops.py", line 68, in train_one_step
    info = do_minibatch_sgd(
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/sgd.py", line 129, in do_minibatch_sgd
    local_worker.learn_on_batch(
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 914, in learn_on_batch
    info_out[pid] = policy.learn_on_batch(batch)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 606, in learn_on_batch
    grads, fetches = self.compute_gradients(postprocessed_batch)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 789, in compute_gradients
    tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1179, in _multi_gpu_parallel_grad_calc
    raise last_result[0] from last_result[1]
ValueError: Expected parameter logits (Tensor of shape (128, 13)) of distribution Categorical(logits: torch.Size([128, 13])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<SubBackward0>)
 tracebackTraceback (most recent call last):
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1095, in _worker
    self.loss(model, self.dist_class, sample_batch)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 87, in loss
    curr_action_dist = dist_class(logits, model)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 103, in __init__
    self.cats = [
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 104, in <listcomp>
    torch.distributions.categorical.Categorical(logits=input_)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/torch/distributions/categorical.py", line 66, in __init__
    super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/torch/distributions/distribution.py", line 56, in __init__
    raise ValueError(
ValueError: Expected parameter logits (Tensor of shape (128, 13)) of distribution Categorical(logits: torch.Size([128, 13])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<SubBackward0>)

In tower 0 on device cuda:0

Hi @ardian-selmonaj and welcome to the discussion board, the error you are facing here is rather due to some observations being NaN or some very high loss values in your policy.

For the former you should check your environment, if it could somehow produce NaN values in observations. And for the latter: do you see large unusual spikes in your policy_loss in TensorBoard?

As a side note: there might be a simpler approach to define your environment, instead of calling inside its step() function the compute_single_action(). The actions should usually be simply passed to this function from RLlib. In RLlib you can also define which policies should be trained and which not (see the configuration parameter policies_to_train).

Hi @Lars_Simon_Zehnder

thank you very much for your answer!

I inspected the policy_loss and value_loss and they seem to be okay. See attached pictures below.
However, I have to inspect my code if nan values can appear in my observations, but I don’t think so. I let you know.

And thanks for the point to define the actions of the opponents outside the environment, but I think that’s not possible to do, because all trainable agents and non-trainable opponents use in the beginning of level 3 the same policy. And if I would define 2 policies in my config file, in which I only train the one for my agents, I could not restore the trained policy because it would not be the same configuration. Or am I misunderstandig something?

@Lars_Simon_Zehnder the problem is not in the observation space nor in the loss function… do you know what could be any other possiblity for the problem? Or should I ask this question in a PyTorch forum ?

Hi @ardian-selmonaj,

There are a few more you can look at. They are total_loss and cur_kl_coeff, kl, entropy.

You might also benefit by setting grad_clip to an integer between [10, 40]