ValueError: integer division or modulo by zero, in tower 0

I have a very strange problem with Ray RLlib. Unfortunately the code is too large for me to reduce it to a simple script. On top of that I don’t know how the problem is caused. I am using Ray 2.0.0. On my local laptop the training of my PPO agent runs through, without any problems. However, when I run the code in a container on my university’s Kubernetes cluster, I always get the following error message. I have absolutely no idea what is happening. I have made sure that the code is running on my local machine in the exact same context as it is running on the cluster. The code I run locally is 1:1 the same as the code I run remotely. Can anyone give me a hint?

> ray::PPO.train_buffered() (pid=133, ip=10.1.11.26, repr=PPO)
>   File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 189, in train_buffered
>     result = self.train()
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 681, in train
>     raise e
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 667, in train
>     result = Trainable.train(self)
>   File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 248, in train
>     result = self.step()
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 206, in step
>     step_results = next(self.train_exec_impl)
>   File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 756, in __next__
>     return next(self.built_iterator)
>   File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
>     for item in it:
>   File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
>     for item in it:
>   File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
>     for item in it:
>   File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
>     for item in it:
>   File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
>     for item in it:
>   File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
>     for item in it:
>   File "/opt/conda/lib/python3.8/site-packages/ray/util/iter.py", line 791, in apply_foreach
>     result = fn(item)
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/execution/train_ops.py", line 197, in __call__
>     results = policy.learn_on_loaded_batch(
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 607, in learn_on_loaded_batch
>     return self.learn_on_batch(batch)
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
>     return func(self, *a, **k)
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 507, in learn_on_batch
>     grads, fetches = self.compute_gradients(postprocessed_batch)
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 336, in compute_gradients
>     return parent_cls.compute_gradients(self, batch)
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
>     return func(self, *a, **k)
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 678, in compute_gradients
>     tower_outputs = self._multi_gpu_parallel_grad_calc(
>   File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 1052, in _multi_gpu_parallel_grad_calc
>     raise last_result[0] from last_result[1]
> ValueError: integer division or modulo by zero
> In tower 0 on device cpu

Hi @LukasNothhelfer,

Are you intending to train with a cpu or gpu and if gpu how many?

I am intending to train with a cpu:

		("framework", "torch"), 
                ("log_level", "WARN"),
		("num_gpus", 0),  
		("num_workers", 0),  
		("num_envs_per_worker", 1),  
		("num_cpus_per_worker", 1),  
		("num_gpus_per_worker", 0),
		("custom_resources_per_worker", {}),
		("evaluation_num_workers", 1), 
		("num_cpus_for_driver", 1),  
		("create_env_on_driver", True),  

@mannyv But wait a second. I didnt specify the resources per trial. The kubernetes cluster has gpus which i dont want to use. Let me check if this problem is solved when i specify the resources_per_trial argument.

@LukasNothhelfer,

Sometimes when I want to use a cpu but the system has gpus and torch is installed with gpu support enabled then I have to manually set

export CUDA_VISIBLE_DEVICES=""

before running.

You could also make an environment that has the cpu only version.

Whatever the equivalent of this is for your setup:

conda install pytorch torchvision torchaudio cpuonly -c pytorch

@mannyv I can’t specify resources_per_trial because I use Ray RLLib’s own PPO algorithm and the resources are controlled via config. Unfortunately, it doesn’t work when I empty the environment variable CUDA_VISIBLE_DEVICES either.

@LukasNothhelfer

You could try setting config[“simple_optimizer”]=True

That would avoid the multi_gpu branch of the code but I think it will still try and use gpu for parts.

1 Like

@mannyv Thanks, that solved my problem. Since I train everything on the CPU(s) anyway, I was able to solve the problem with this. I am using Ray2.0.0, which is currently still in development. I suspect that this problem is in the bugs category.