Hello,
I have an error on Windows when using PPO with Pytorch:
...\envs\rllib-pt\lib\site-packages\ray\rllib\policy\torch_policy.py", line 155, in __init__
(pid=18860) self.device = self.devices[0]
(pid=18860) IndexError: list index out of range
I ensured that torch.cuda.is_available()
returns True. The error is due to ray.get_gpu_ids
that returns an empty list, even when num_gpus
was set in ray.init
. It is interesting that ray.resource_spec._autodetect_num_gpus()
returns 1 so we can imagine a fix with this but maybe there is another solution. I am using ray==1.4.
Best,
1 Like
Try downloading to Ray==1.2, lmk if it still doesn’t work!
Hi @michaelzhiluo , thanks for the answer!
However it still does not work with ray==1.2.0.
I got the same error in Ubuntu. This code:
import ray
import ray.rllib.agents.ppo as ppo
import torch
print(torch.cuda.is_available())
ray.init()
config = ppo.DEFAULT_CONFIG
config["num_gpus"] = 1
config["framework"] = "torch"
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")
returns “True” for cuda and this at the end of the the Traceback:
File "/home/bukovskiy/.local/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 159, in __init__
self.device = self.devices[0]
IndexError: list index out of range
referring to these lines in torch_policy.py:
self.devices = [
torch.device("cuda:{}".format(i))
for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
]
self.device = self.devices[0]
Thanks for the test on Ubuntu and the reproduction code!
I am not sure why we need to call ray.get_gpu_ids
as the only info we need is the number of gpus set in the config. What do you think @sven1977?
Any update on this? @michaelzhiluo
Try import tensorflow as tf
and check if tf.config.list_physical_devices('GPU')
returns a list of GPUs. Make sure you have CUDA and CUDNN installed properly.
Hey @Fabien-Couthouis , thanks for posting this question. We check the actually visible GPUs via ray.get_gpu_ids
to make sure we build the correct policy. Using config.num_gpus
would not work here as on the rollout workers no GPUs should be used (but their copy of the Policy still has config.num_gpus>0
).
This may actually be a ray on Win bug? @sangcho @Alex , any ideas here?