Error: nan Tensors in PyTorch with Ray RLlib for MARL

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, I am using Ray RLlib to train a multi-agent reinforcement learning model on Python. The environment is self-made. The task is to train agents to fight against each other in two groups.

I am training my model in a curriculum learning fashion, but I don’t use the callback function to set a new task during training. I change the task manually (this is just my preference).

My agents use a shared policy to train. The agents (opponents) from the other group are not trained, but are hardcoded agents. However, in my current task level (level 3), I am doing a kind of self-play. Concretely, in this level 3, the agents as well as the opponents start with a copy of the policy obtained in level 2, but the agent’s policy will be updated during training, whereas the policy of the opponent is freezed, so it just does inference. And during training in this configuration, I sometimes get an error from PyTorch. I will list you the most important code snippets used. And below you will see the error code that appears after around 6000 epochs. I suspect it’s because of the rewards, which might be 0 for several episodes.

I don’t know if it is the right way to set-up self-play (for ray rllib), but in the environment, I basicly load the whole set-up as for training and restore the algorithm from level 2. See in the code ENV below. The BasicEnv() class is just a blank class to be able to set-up the algorithm inside the environment.

If more and detailed code is needed, let me know! I have Python 3.10.6, PyTorch 1.13.1 with cuda 11.7 and ray 2.0.0.

CODE:

config = {
    "multiagent": {
        "policies": {
            "shared_policy": PolicySpec(
                config={
                    "model": {
                        "fcnet_hiddens" : [HL1, HL2],
                        "fcnet_activation": "tanh",
                        "vf_share_layers": True,
                    }
                }
            )
        },
        "policy_mapping_fn": (
            lambda agent_id, episode, **kwargs: "shared_policy"
        ),
    },
    "train_batch_size": 2000,
    "rollout_fragment_length": 1000,
    "gamma": 0.99,
    "framework": "torch",
    "horizon": HORIZON,
    "lambda": 0.8,
    "clip_param": 0.2,
    "lr": 1e-4,
    "num_workers": 4,
    "num_gpus": 1,
}
algo = ppo.PPO(env=Dogfight, config=config)
algo.restore(path)
for i in range(10001):
        result = algo.train()

ENV:

class BasicEnv(MultiAgentEnv)
  def __init__(self,):
        super().__init__()

  def reset(self):
    pass
  
  def step():
    pass


class Dogfight(MultiAgentEnv)
  def __init__(self, env_config):
   super().__init__()
   ...
   self.algo = self.setup_ss()

  def setup_ss(self):
    ss_config = {
    "multiagent": {
        "policies": {
            "shared_policy": PolicySpec(
                config={
                    "model": {
                        "fcnet_hiddens" : [512, 512],
                        "fcnet_activation": "tanh",
                        "vf_share_layers": True,
                    }
                }
            )
        },
        "policy_mapping_fn": (
            lambda agent_id, episode, **kwargs: "shared_policy"
        ),
    },
    "train_batch_size": 2000,
    "rollout_fragment_length": 1000,
    "gamma": 0.99,
    "framework": "torch",
    "horizon": self.horizon,
    "lambda": 0.8,
    "clip_param": 0.2,
    "lr": 1e-4,
    "num_workers": 1,
    "explore": False
    }
    algo = ppo.PPO(env=BasicEnv, config=ss_config)
    algo.restore(self.restore_path)
    return algo

def step(self, actions):
  ...
  opponent_state = self.opponent_state() #returns observation from opponent perspective
  opponent_actions = self.algo.compute_single_action(observation=opponent_state, policy_id="shared_policy")
  ...

  *apply actions for agents and opponents*

ERROR:

Traceback (most recent call last):
  File "/home/ardianselmonaj/Projects/marl-warsim/train.py", line 847, in <module>
    start_training(algo)
  File "/home/ardianselmonaj/Projects/marl-warsim/train.py", line 809, in start_training
    result = algo.train()
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 347, in train
    result = self.step()
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 661, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2378, in _run_one_training_iteration
    num_recreated += self.try_recover_from_step_attempt(
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2190, in try_recover_from_step_attempt
    raise error
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2373, in _run_one_training_iteration
    results = self.training_step()
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 418, in training_step
    train_results = train_one_step(self, train_batch)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/execution/train_ops.py", line 68, in train_one_step
    info = do_minibatch_sgd(
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/sgd.py", line 129, in do_minibatch_sgd
    local_worker.learn_on_batch(
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 914, in learn_on_batch
    info_out[pid] = policy.learn_on_batch(batch)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 606, in learn_on_batch
    grads, fetches = self.compute_gradients(postprocessed_batch)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 789, in compute_gradients
    tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1179, in _multi_gpu_parallel_grad_calc
    raise last_result[0] from last_result[1]
ValueError: Expected parameter logits (Tensor of shape (128, 13)) of distribution Categorical(logits: torch.Size([128, 13])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<SubBackward0>)
 tracebackTraceback (most recent call last):
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1095, in _worker
    self.loss(model, self.dist_class, sample_batch)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 87, in loss
    curr_action_dist = dist_class(logits, model)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 103, in __init__
    self.cats = [
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 104, in <listcomp>
    torch.distributions.categorical.Categorical(logits=input_)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/torch/distributions/categorical.py", line 66, in __init__
    super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/torch/distributions/distribution.py", line 56, in __init__
    raise ValueError(
ValueError: Expected parameter logits (Tensor of shape (128, 13)) of distribution Categorical(logits: torch.Size([128, 13])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<SubBackward0>)

In tower 0 on device cuda:0

Hi @ardian-selmonaj and welcome to the discussion board, the error you are facing here is rather due to some observations being NaN or some very high loss values in your policy.

For the former you should check your environment, if it could somehow produce NaN values in observations. And for the latter: do you see large unusual spikes in your policy_loss in TensorBoard?

As a side note: there might be a simpler approach to define your environment, instead of calling inside its step() function the compute_single_action(). The actions should usually be simply passed to this function from RLlib. In RLlib you can also define which policies should be trained and which not (see the configuration parameter policies_to_train).

Hi @Lars_Simon_Zehnder

thank you very much for your answer!

I inspected the policy_loss and value_loss and they seem to be okay. See attached pictures below.
However, I have to inspect my code if nan values can appear in my observations, but I don’t think so. I let you know.

And thanks for the point to define the actions of the opponents outside the environment, but I think that’s not possible to do, because all trainable agents and non-trainable opponents use in the beginning of level 3 the same policy. And if I would define 2 policies in my config file, in which I only train the one for my agents, I could not restore the trained policy because it would not be the same configuration. Or am I misunderstandig something?

@Lars_Simon_Zehnder the problem is not in the observation space nor in the loss function… do you know what could be any other possiblity for the problem? Or should I ask this question in a PyTorch forum ?

Hi @ardian-selmonaj,

There are a few more you can look at. They are total_loss and cur_kl_coeff, kl, entropy.

You might also benefit by setting grad_clip to an integer between [10, 40]

Hi, what solution did you find to this problem? I’m currently facing the same issue.

My custom multi agent environment trains fine with PPO, but not with APPO, after ~5 minutes of training I get a similar error. I checked the rewards and observations, these are never nan. I tried setting grad_clip to 2 which didn’t help either.

APPO pid=16752) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/David/ray_results/APPO_2024-03-17_17-23-13/APPO_flight_aafd8_00000_0_2024-03-17_17-23-20/checkpoint_000002)
(APPO pid=16752) Exception in thread Thread-1:
(APPO pid=16752) Traceback (most recent call last):
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1369, in _worker
(APPO pid=16752)     self.loss(model, self.dist_class, sample_batch)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\appo\appo_torch_policy.py", line 134, in loss
(APPO pid=16752)     action_dist = dist_class(model_out, model)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 250, in __init__
(APPO pid=16752)     self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\normal.py", line 56, in __init__
(APPO pid=16752)     super().__init__(batch_shape, validate_args=validate_args)
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\distribution.py", line 68, in __init__
(APPO pid=16752)     raise ValueError(
(APPO pid=16752) ValueError: Expected parameter loc (Tensor of shape (550, 2)) of distribution Normal(loc: torch.Size([550, 2]), scale: torch.Size([550, 2])) to satisfy the constraint Real(), but found invalid values:
(APPO pid=16752) tensor([[nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         ...,
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan]], device='cuda:0', grad_fn=<SplitBackward0>)
(APPO pid=16752) 
(APPO pid=16752) The above exception was the direct cause of the following exception:
(APPO pid=16752) 
(APPO pid=16752) Traceback (most recent call last):
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\threading.py", line 980, in _bootstrap_inner
(APPO pid=16752)     self.run()
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\execution\learner_thread.py", line 76, in run
(APPO pid=16752)     self.step()
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\execution\learner_thread.py", line 93, in step
(APPO pid=16752)     multi_agent_results = self.local_worker.learn_on_batch(batch)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 815, in learn_on_batch
(APPO pid=16752)     info_out[pid] = policy.learn_on_batch(batch)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
(APPO pid=16752)     return func(self, *a, **k)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 730, in learn_on_batch
(APPO pid=16752)     grads, fetches = self.compute_gradients(postprocessed_batch)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
(APPO pid=16752)     return func(self, *a, **k)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 946, in compute_gradients
(APPO pid=16752)     tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1454, in _multi_gpu_parallel_grad_calc
(APPO pid=16752)     raise last_result[0] from last_result[1]
(APPO pid=16752) ValueError: Expected parameter loc (Tensor of shape (550, 2)) of distribution Normal(loc: torch.Size([550, 2]), scale: torch.Size([550, 2])) to satisfy the constraint Real(), but found invalid values:
(APPO pid=16752) tensor([[nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         ...,
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan]], device='cuda:0', grad_fn=<SplitBackward0>)
(APPO pid=16752)  tracebackTraceback (most recent call last):
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1369, in _worker
(APPO pid=16752)     self.loss(model, self.dist_class, sample_batch)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\appo\appo_torch_policy.py", line 134, in loss
(APPO pid=16752)     action_dist = dist_class(model_out, model)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 250, in __init__
(APPO pid=16752)     self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\normal.py", line 56, in __init__
(APPO pid=16752)     super().__init__(batch_shape, validate_args=validate_args)
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\distribution.py", line 68, in __init__
(APPO pid=16752)     raise ValueError(
(APPO pid=16752) ValueError: Expected parameter loc (Tensor of shape (550, 2)) of distribution Normal(loc: torch.Size([550, 2]), scale: torch.Size([550, 2])) to satisfy the constraint Real(), but found invalid values:
(APPO pid=16752) tensor([[nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         ...,
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan],
(APPO pid=16752)         [nan, nan]], device='cuda:0', grad_fn=<SplitBackward0>)
(APPO pid=16752) 
(APPO pid=16752) In tower 0 on device cuda:0
2024-03-17 17:31:36,097	ERROR tune_controller.py:1374 -- Trial task failed for trial APPO_flight_aafd8_00000
Traceback (most recent call last):
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\air\execution\_internal\event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\_private\auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\_private\worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::APPO.train() (pid=16752, ip=127.0.0.1, actor_id=e95f1d1a63644982222642f601000000, repr=APPO)
  File "python\ray\_raylet.pyx", line 1813, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\_private\function_manager.py", line 726, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\tune\trainable\trainable.py", line 342, in train
    raise skipped from exception_cause(skipped)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\tune\trainable\trainable.py", line 339, in train
    result = self.step()
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py", line 852, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py", line 3042, in _run_one_training_iteration
    results = self.training_step()
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\appo\appo.py", line 363, in training_step
    train_results = super().training_step()
  File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\impala\impala.py", line 698, in training_step
    raise RuntimeError("The learner thread died while training!")
RuntimeError: The learner thread died while training!
2024-03-17 17:31:36,659	ERROR tune.py:1038 -- Trials did not complete: [APPO_flight_aafd8_00000]
2024-03-17 17:31:36,660	INFO tune.py:1042 -- Total run time: 496.57 seconds (495.95 seconds for the tuning loop).
(APPO pid=16752) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RolloutWorker.apply() (pid=27664, ip=127.0.0.1, actor_id=2d3e472c64004a0f0721650a01000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x000001FEF99E3E50>)
(APPO pid=16752)   File "python\ray\_raylet.pyx", line 1813, in ray._raylet.execute_task
(APPO pid=16752)   File "python\ray\_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\_private\function_manager.py", line 726, in actor_method_executor
(APPO pid=16752)     return method(__ray_actor, *args, **kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
(APPO pid=16752)     return method(self, *_args, **_kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\actor_manager.py", line 189, in apply
(APPO pid=16752)     raise e
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\actor_manager.py", line 178, in apply
(APPO pid=16752)     return func(self, *args, **kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\impala\impala.py", line 912, in <lambda>
(APPO pid=16752)     lambda worker: worker.sample(),
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
(APPO pid=16752)     return method(self, *_args, **_kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 694, in sample
(APPO pid=16752)     batches = [self.input_reader.next()]
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\sampler.py", line 91, in next
(APPO pid=16752)     batches = [self.get_data()]
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\sampler.py", line 276, in get_data
(APPO pid=16752)     item = next(self._env_runner)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 344, in run
(APPO pid=16752)     outputs = self.step()
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 382, in step
(APPO pid=16752)     eval_results = self._do_policy_eval(to_eval=to_eval)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 1082, in _do_policy_eval
(APPO pid=16752)     eval_results[policy_id] = policy.compute_actions_from_input_dict(
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 572, in compute_actions_from_input_dict
(APPO pid=16752)     return self._compute_action_helper(
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
(APPO pid=16752)     return func(self, *a, **k)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1305, in _compute_action_helper
(APPO pid=16752)     action_dist = dist_class(dist_inputs, self.model)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 250, in __init__
(APPO pid=16752)     self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\normal.py", line 56, in __init__
(APPO pid=16752)     super().__init__(batch_shape, validate_args=validate_args)
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\distribution.py", line 68, in __init__
(APPO pid=16752)     raise ValueError(
(APPO pid=16752) ValueError: Expected parameter loc (Tensor of shape (1, 2)) of distribution Normal(loc: torch.Size([1, 2]), scale: torch.Size([1, 2])) to satisfy the constraint Real(), but found invalid values:
(APPO pid=16752) tensor([[nan, nan]])
(APPO pid=16752) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RolloutWorker.apply() (pid=26308, ip=127.0.0.1, actor_id=94217366322346472d52231601000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x000001C90F9B3EE0>)
(APPO pid=16752)   File "python\ray\_raylet.pyx", line 1813, in ray._raylet.execute_task
(APPO pid=16752)   File "python\ray\_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\_private\function_manager.py", line 726, in actor_method_executor
(APPO pid=16752)     return method(__ray_actor, *args, **kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
(APPO pid=16752)     return method(self, *_args, **_kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\actor_manager.py", line 189, in apply
(APPO pid=16752)     raise e
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\actor_manager.py", line 178, in apply
(APPO pid=16752)     return func(self, *args, **kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\impala\impala.py", line 912, in <lambda>
(APPO pid=16752)     lambda worker: worker.sample(),
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
(APPO pid=16752)     return method(self, *_args, **_kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 694, in sample
(APPO pid=16752)     batches = [self.input_reader.next()]
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\sampler.py", line 91, in next
(APPO pid=16752)     batches = [self.get_data()]
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\sampler.py", line 276, in get_data
(APPO pid=16752)     item = next(self._env_runner)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 344, in run
(APPO pid=16752)     outputs = self.step()
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 382, in step
(APPO pid=16752)     eval_results = self._do_policy_eval(to_eval=to_eval)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 1082, in _do_policy_eval
(APPO pid=16752)     eval_results[policy_id] = policy.compute_actions_from_input_dict(
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 572, in compute_actions_from_input_dict
(APPO pid=16752)     return self._compute_action_helper(
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
(APPO pid=16752)     return func(self, *a, **k)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1305, in _compute_action_helper
(APPO pid=16752)     action_dist = dist_class(dist_inputs, self.model)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 250, in __init__
(APPO pid=16752)     self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\normal.py", line 56, in __init__
(APPO pid=16752)     super().__init__(batch_shape, validate_args=validate_args)
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\distribution.py", line 68, in __init__
(APPO pid=16752)     raise ValueError(
(APPO pid=16752) ValueError: Expected parameter loc (Tensor of shape (1, 2)) of distribution Normal(loc: torch.Size([1, 2]), scale: torch.Size([1, 2])) to satisfy the constraint Real(), but found invalid values:
(APPO pid=16752) tensor([[nan, nan]])
(APPO pid=16752) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RolloutWorker.apply() (pid=27040, ip=127.0.0.1, actor_id=6485255fc80476e4f491571201000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x0000029921ED4E20>)
(APPO pid=16752)   File "python\ray\_raylet.pyx", line 1813, in ray._raylet.execute_task
(APPO pid=16752)   File "python\ray\_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\_private\function_manager.py", line 726, in actor_method_executor
(APPO pid=16752)     return method(__ray_actor, *args, **kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
(APPO pid=16752)     return method(self, *_args, **_kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\actor_manager.py", line 189, in apply
(APPO pid=16752)     raise e
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\actor_manager.py", line 178, in apply
(APPO pid=16752)     return func(self, *args, **kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\algorithms\impala\impala.py", line 912, in <lambda>
(APPO pid=16752)     lambda worker: worker.sample(),
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
(APPO pid=16752)     return method(self, *_args, **_kwargs)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 694, in sample
(APPO pid=16752)     batches = [self.input_reader.next()]
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\sampler.py", line 91, in next
(APPO pid=16752)     batches = [self.get_data()]
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\sampler.py", line 276, in get_data
(APPO pid=16752)     item = next(self._env_runner)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 344, in run
(APPO pid=16752)     outputs = self.step()
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 382, in step
(APPO pid=16752)     eval_results = self._do_policy_eval(to_eval=to_eval)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 1082, in _do_policy_eval
(APPO pid=16752)     eval_results[policy_id] = policy.compute_actions_from_input_dict(
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 572, in compute_actions_from_input_dict
(APPO pid=16752)     return self._compute_action_helper(
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
(APPO pid=16752)     return func(self, *a, **k)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\policy\torch_policy_v2.py", line 1305, in _compute_action_helper
(APPO pid=16752)     action_dist = dist_class(dist_inputs, self.model)
(APPO pid=16752)   File "c:\Users\David\.conda\envs\rllib\lib\site-packages\ray\rllib\models\torch\torch_action_dist.py", line 250, in __init__
(APPO pid=16752)     self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\normal.py", line 56, in __init__
(APPO pid=16752)     super().__init__(batch_shape, validate_args=validate_args)
(APPO pid=16752)   File "C:\Users\David\AppData\Roaming\Python\Python39\site-packages\torch\distributions\distribution.py", line 68, in __init__
(APPO pid=16752)     raise ValueError(
(APPO pid=16752) ValueError: Expected parameter loc (Tensor of shape (1, 2)) of distribution Normal(loc: torch.Size([1, 2]), scale: torch.Size([1, 2])) to satisfy the constraint Real(), but found invalid values:
(APPO pid=16752) tensor([[nan, nan]])
(APPO pid=16752) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RolloutWorker.apply() (pid=18652, ip=127.0.0.1, actor_id=67aa0e07cebf5a900a4db01301000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x000002AE63F03EB0>)
(APPO pid=16752)   File "python\ray\_raylet.pyx", line 1813, in ray._raylet.execute_task
...
...
Clipped to adhere to max post length

I am also getting the same issue. Anyone found a solution yet?

Hello, from my side I could never find the exact reason for this error. However, intuitivelly I assume it might arise in sparse reward environments, especially with long horizons. Modifying these to points, this error disappeard in my experiments.

I had this problem for weeks and I was able to solve it by tinkering with the number of GPUs assigned. I stepped through the code and found that if i assigned too many GPUs or too many learners, then it would result in an NaN being computed in various locations (for example computing the variance of tensor with only one element).

Hence if I actually decreased the number of learners, did not specify the num_gpus parameter in the resources() and set the batch sizes properly (so that each learner had enough samples), I ended up getting the training to work properly.

I have seen this particular error (about the action distribution loc being nan) being posted about by many people and I suspect each person may have a slightly different reason. However, since this fix worked for me, it is probably a good idea to try out these changes.

I also had a custom environment with a long (infinite) horizon and no terminations.

This sounds amazing as I found it very much useful and informative to be honest. Also, I have gone through this post which definitely helped me out a lot as a new member I am looking forward for more such discussions. This sounds amazing as I found it very much useful and informative to be honest. Also, I have gone through this post which definitely helped me out a lot as a new member I am looking forward for more such discussions.