I have a very strange problem with Ray RLlib. Unfortunately the code is too large for me to reduce it to a simple script. On top of that I don’t know how the problem is caused. I am using Ray 2.0.0. On my local laptop the training of my PPO agent runs through, without any problems. However, when I run the code in a container on my university’s Kubernetes cluster, I always get the following error message. I have absolutely no idea what is happening. I have made sure that the code is running on my local machine in the exact same context as it is running on the cluster. The code I run locally is 1:1 the same as the code I run remotely. Can anyone give me a hint?
> ray::PPO.train_buffered() (pid=133, ip=10.1.11.26, repr=PPO)
> File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 189, in train_buffered
> result = self.train()
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 681, in train
> raise e
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 667, in train
> result = Trainable.train(self)
> File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 248, in train
> result = self.step()
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 206, in step
> step_results = next(self.train_exec_impl)
> File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 756, in __next__
> return next(self.built_iterator)
> File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
> for item in it:
> File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
> for item in it:
> File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
> for item in it:
> File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
> for item in it:
> File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
> for item in it:
> File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
> for item in it:
> File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 791, in apply_foreach
> result = fn(item)
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/execution/train_ops.py", line 197, in __call__
> results = policy.learn_on_loaded_batch(
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 607, in learn_on_loaded_batch
> return self.learn_on_batch(batch)
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
> return func(self, *a, **k)
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 507, in learn_on_batch
> grads, fetches = self.compute_gradients(postprocessed_batch)
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 336, in compute_gradients
> return parent_cls.compute_gradients(self, batch)
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
> return func(self, *a, **k)
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 678, in compute_gradients
> tower_outputs = self._multi_gpu_parallel_grad_calc(
> File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 1052, in _multi_gpu_parallel_grad_calc
> raise last_result[0] from last_result[1]
> ValueError: integer division or modulo by zero
> In tower 0 on device cpu