How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to train PPO policy in a custom environment. After training for around 50 iterations the following error is thrown:
ERROR trial_runner.py:1088 -- Trial experiment_HierarchicalGraphColorEnv_bc37e_00000: Error processing event.
ray.exceptions.RayTaskError(ValueError): e[36mray::ImplicitFunc.train()e[39m (pid=8039, ip=172.10.3.120, repr=experiment)
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 367, in train
raise skipped from exception_cause(skipped)
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/tune/trainable/function_trainable.py", line 338, in entrypoint
self._status_reporter.get_checkpoint(),
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
output = fn()
File "/home/venkatakeerthy.cs.iith/ML-Register-Allocation/model/RegAlloc/ggnn_drl/rllib_split_model/src/experiment_ppo.py", line 59, in experiment
train_results = train_agent.train()
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 367, in train
raise skipped from exception_cause(skipped)
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 364, in train
result = self.step()
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/rllib/algorithms/algorithm.py", line 749, in step
results, train_iter_ctx = self._run_one_training_iteration()
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/rllib/algorithms/algorithm.py", line 2623, in _run_one_training_iteration
results = self.training_step()
File "/home/venkatakeerthy.cs.iith/ML-Register-Allocation/model/RegAlloc/ggnn_drl/rllib_split_model/src/ppo_new.py", line 379, in training_step
train_results = train_one_step(self, train_batch)
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/rllib/execution/train_ops.py", line 62, in train_one_step
[],
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/rllib/utils/sgd.py", line 135, in do_minibatch_sgd
learner_info = learner_info_builder.finalize()
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/ray/rllib/utils/metrics/learner_info.py", line 87, in finalize
_all_tower_reduce, *results_all_towers
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/tree/__init__.py", line 550, in map_structure_with_path
**kwargs)
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/tree/__init__.py", line 841, in map_structure_with_path_up_to
shallow_structure, input_tree, check_types=check_types)
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/tree/__init__.py", line 684, in _assert_shallow_structure
shallow_branch, input_branch, check_types=check_types)
File "/home/users/anaconda3/envs/conda-env/lib/python3.7/site-packages/tree/__init__.py", line 664, in _assert_shallow_structure
shallow_length=_num_elements(shallow_tree)))
ValueError: The two structures don't have the same sequence length. Input structure has length 10, while shallow structure has length 11.
I recently upgraded the Ray version from 1.4 to 2.2.0. In the later version of ray (2.2.0) the code related to sgd result info is changed, and the error is occurring in the changed code only.
Any help to understand the issue better or towards fixing it is really appreciated.