Hi everyone,
I’m trying to train a PPO model using RLlib, but I keep encountering the following error, which I assume is due to the gradients “dying”:
File "/opt/conda/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 85, in loss
curr_action_dist = dist_class(logits, model)
File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 512, in __init__
self.flat_child_distributions = tree.map_structure(
File "/opt/conda/lib/python3.9/site-packages/tree/__init__.py", line 435, in map_structure
[func(*args) for args in zip(*map(flatten, structures))])
File "/opt/conda/lib/python3.9/site-packages/tree/__init__.py", line 435, in <listcomp>
[func(*args) for args in zip(*map(flatten, structures))])
File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 513, in <lambda>
lambda dist, input_: dist(input_, model),
File "/opt/conda/lib/python3.9/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 250, in __init__
self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
File "/opt/conda/lib/python3.9/site-packages/torch/distributions/normal.py", line 56, in __init__
super().__init__(batch_shape, validate_args=validate_args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributions/distribution.py", line 68, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (128, 2)) of distribution Normal(loc: torch.Size([128, 2]), scale: torch.Size([128, 2])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan],
[nan, nan],
...
[nan, nan]], grad_fn=<SplitBackward0>)
Additionally, I’ve noticed that the reward is very negative, but the loss quickly approaches zero, around 1e-5. The error occurs precisely when the policy loss reaches this point. Interestingly, switching the algorithm to APPO resolves the issue.
Can anyone explain what’s happening here? Why does this error occur with PPO but not with APPO?
Thanks in advance for your help!