How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’m running a self built multi agent env with PPO on a local ray cluster. At the beggining everything works fine but after arround 200000 env steps (1:20 hours in my case) into the training im running into following error:
(run pid=2061) 2022-12-03 20:30:37,786 ERROR trial_runner.py:980 -- Trial PPO_StepRenderSplitHierarchicalEnv_5c5ff_00000: Error processing event.
(run pid=2061) ray.exceptions.RayTaskError(ValueError): ray::PPO.train() (pid=1314, ip=192.168.178.22, repr=PPO)
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 500, in _yield_flat_up_to
(run pid=2061) for leaf_path, leaf_value in _yield_flat_up_to(shallow_subtree,
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 499, in _yield_flat_up_to
(run pid=2061) input_subtree = input_tree[shallow_key]
(run pid=2061) KeyError: 'grad_gnorm'
(run pid=2061)
(run pid=2061) The above exception was the direct cause of the following exception:
(run pid=2061)
(run pid=2061) ray::PPO.train() (pid=1314, ip=192.168.178.22, repr=PPO)
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 347, in train
(run pid=2061) result = self.step()
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 661, in step
(run pid=2061) results, train_iter_ctx = self._run_one_training_iteration()
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 2378, in _run_one_training_iteration
(run pid=2061) num_recreated += self.try_recover_from_step_attempt(
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 2190, in try_recover_from_step_attempt
(run pid=2061) raise error
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 2373, in _run_one_training_iteration
(run pid=2061) results = self.training_step()
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/ppo/ppo.py", line 418, in training_step
(run pid=2061) train_results = train_one_step(self, train_batch)
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/rllib/execution/train_ops.py", line 68, in train_one_step
(run pid=2061) info = do_minibatch_sgd(
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/sgd.py", line 135, in do_minibatch_sgd
(run pid=2061) learner_info = learner_info_builder.finalize()
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/metrics/learner_info.py", line 86, in finalize
(run pid=2061) info[policy_id] = tree.map_structure_with_path(
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 469, in map_structure_with_path
(run pid=2061) return map_structure_with_path_up_to(structures[0], func, *structures,
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 758, in map_structure_with_path_up_to
(run pid=2061) for path_and_values in _multiyield_flat_up_to(shallow_structure, *structures):
(run pid=2061) File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 516, in _multiyield_flat_up_to
(run pid=2061) raise ValueError(f"Could not find key '{e.args[0]}' in some `input_trees`. "
(run pid=2061) ValueError: Could not find key 'grad_gnorm' in some `input_trees`. Please ensure the structure of all `input_trees` are compatible with `shallow_tree`. The last valid path yielded was ('learner_stats', 'entropy_coeff').
Does anyone have an idea what the problem could be here and point me in the right direction to fix it? Thanks in advance!