Missing 'grad_gnorm' key in some `input_trees` after some training time

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m running a self built multi agent env with PPO on a local ray cluster. At the beggining everything works fine but after arround 200000 env steps (1:20 hours in my case) into the training im running into following error:

(run pid=2061) 2022-12-03 20:30:37,786	ERROR trial_runner.py:980 -- Trial PPO_StepRenderSplitHierarchicalEnv_5c5ff_00000: Error processing event.
(run pid=2061) ray.exceptions.RayTaskError(ValueError): ray::PPO.train() (pid=1314, ip=192.168.178.22, repr=PPO)
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 500, in _yield_flat_up_to
(run pid=2061)     for leaf_path, leaf_value in _yield_flat_up_to(shallow_subtree,
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 499, in _yield_flat_up_to
(run pid=2061)     input_subtree = input_tree[shallow_key]
(run pid=2061) KeyError: 'grad_gnorm'
(run pid=2061) 
(run pid=2061) The above exception was the direct cause of the following exception:
(run pid=2061) 
(run pid=2061) ray::PPO.train() (pid=1314, ip=192.168.178.22, repr=PPO)
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 347, in train
(run pid=2061)     result = self.step()
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 661, in step
(run pid=2061)     results, train_iter_ctx = self._run_one_training_iteration()
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 2378, in _run_one_training_iteration
(run pid=2061)     num_recreated += self.try_recover_from_step_attempt(
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 2190, in try_recover_from_step_attempt
(run pid=2061)     raise error
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 2373, in _run_one_training_iteration
(run pid=2061)     results = self.training_step()
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/ppo/ppo.py", line 418, in training_step
(run pid=2061)     train_results = train_one_step(self, train_batch)
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/rllib/execution/train_ops.py", line 68, in train_one_step
(run pid=2061)     info = do_minibatch_sgd(
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/sgd.py", line 135, in do_minibatch_sgd
(run pid=2061)     learner_info = learner_info_builder.finalize()
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/metrics/learner_info.py", line 86, in finalize
(run pid=2061)     info[policy_id] = tree.map_structure_with_path(
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 469, in map_structure_with_path
(run pid=2061)     return map_structure_with_path_up_to(structures[0], func, *structures,
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 758, in map_structure_with_path_up_to
(run pid=2061)     for path_and_values in _multiyield_flat_up_to(shallow_structure, *structures):
(run pid=2061)   File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 516, in _multiyield_flat_up_to
(run pid=2061)     raise ValueError(f"Could not find key '{e.args[0]}' in some `input_trees`. "
(run pid=2061) ValueError: Could not find key 'grad_gnorm' in some `input_trees`. Please ensure the structure of all `input_trees` are compatible with `shallow_tree`. The last valid path yielded was ('learner_stats', 'entropy_coeff').

Does anyone have an idea what the problem could be here and point me in the right direction to fix it? Thanks in advance!

For anyone having the same problem. In my case the issue seemed to originate from missing rewards returns for some agents that acted in my environment. After fixing this by returning 0 rewards for them the training worked fine again.

1 Like

Ok the problem is still not fixed. To me it seems that the error somewhere comes from the SGD part in PPO so that the grad_gnorm can’t be returned to the metrics dict. So far i tried:

  • double-checking my obs, rew, dones using the mulitagent examples from rllib as guidelines.
  • experimenting with the vf_clip_param in PPO because when i was running the training on a single machine i got a warning that it might be to low.

I’d be glad for any hints i can get on this.

I’m having the same issue, although I’m seeing it at lesser env steps (~8000). My case is for a single agent, and my rewards are always >=0. Following this thread.

Hi @Blubberblub, hi @dylan906 ,

I believe that this has been fixed now. The related commit should be in nightlies and the next release (2.3). Thanks for reporting this!

1 Like

For reference, here’s the commit I believe does the fix in the grad norm function.

1 Like

@arturn,

I was just looking at the fix. It makes me wonder if that fix is masking some other kind of issue?

Why in the middle of training are there now cases where some minibatches have no gradients. I can think of reasons where this might be completely expected but I can also think of cases where it would actually be either a bug or manifesting because of training issues.

@dylan906 are you able to provide any kind of reproduction script.

@mannyv I have seen cases of empty sample batches being passed around in RLlib. Can’t recall if that was in any regular scenario. That would definitely be an issue.
A repro script would be cool. Also I can’t guarantee that this is the fix, since there is no repro script.

@arturn Thanks for the reply and pointing out a potential issue. I switched to DQN to check if the problem may exist because my env returns are faulty, but everything works fine there. I will try to switch from 2.1.0 to master and in case this breaks my whole project i will wait for 2.3.0 and check there.

1 Like

@mannyv, I am an aerospace student and a tourist to ML, so unfortunately I can’t provide a reproduction script because my loosely-organized pile of files makes it difficult to reproduce the error. Or rather, if I did provide any script, it would cause more confusion and be anti-useful. Sorry!

Small update: I’m still testing. Currently i am struggling to install the wheels from master for 3.0.0.dev0 including all the extras(rllib, tune, etc.) since im using pipenv in my local developlement system and there seems to be an issue in pipenv that makes it impossible to install from remote wheel with extras. So i will stick with the 2.2.0 release for now. @arturn How do you handle virtual environments etc. for testing in the anyscale team?

@Blubberblub we install nightly wheels or build from source and link files with setup-dev.py for local development and testing. Tests otherwise run in our CI system (you can find that at the bottom of any PR page).

1 Like

@arturn Thanks for the info on your testing process. I tried with the master(3.0.0.dev) and the problem seems to be solved. Thanks for looking into it!

Hi @Blubberblub, same problem here with grad_gnorm (version 2.2.0) . How did you manage to install version 3.0.0.dev from wheels with all the extras? Thanks!

@Matteo_Pagliani Hey Mateo, welcome to the community! When you use pip you can use the instructions provided here: Installing Ray — Ray 3.0.0.dev0
Ray 3.0.0.dev is the current master. You have to choose the right wheel for your operating system and python version.
Example:

# install ray with defaul, rllib, tune and air for linux and python 3.10
pip install -U "ray[default, rllib, tune, air] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"

You can easily find the fitting wheel name (last part of the url) when you download the wheels from the nightlies section and change the url above.

@Blubberblub Thanks a lot for your rapid answer! I managed to install ray 3.3.0 with all the extras. Unfortunately, now I get an error at the very start of the program that I wasn’t getting with version 2.2.0. The error is: TypeError: reset() got an unexpected keyword argument ‘seed’. I suspect the error is related in some way to the Gymnasium (or maybe gym) version that ray 3.3.0 has installed. Do you have an idea about how I can resolve? Thanks!

@Matteo_Pagliani Yes, you have to uninstall gym and install gymnasium.

@Matteo_Pagliani In case you are using your own environments you also have to check if they meet the new gymnasium API. I found RLib expects you to write your environments accordingly. This means adding some extra arguments to your reset and step function as well as adjusting your returns accordingly since the done return was split into terminated and truncated. If you don’t write your own environments you don’t have to worry about that.
All the environment examples in the master version use the new API so you could check them here:

there is a utility that gymnasium provides to convert old gym environments into gymnasium environments Compatibility with Gym - Gymnasium Documentation

1 Like

Hi, Blubberblub

As with previous discussions, I also installed the nightly version of Ray and as a result, I got the error ‘TypeError: reset() got an unexpected keyword argument ‘seed’’.

Since I am using a custom environment called GitHub - utiasDSL/gym-pybullet-drones: PyBullet Gym environments for single and multi-agent reinforcement learning of quadcopter control, I think I need to check if it is applicable to the gymnasium API as you mentioned. I understand that I need to set additional arguments for the step function and reset function, but this means that I need to make changes to the custom environment, correct? It cannot be universally applied through string conversion or something similar, correct?

I am struggling with whether to wait until the release of Ray 2.3 if the code changes needed to use the nightly version become complicated.