Error with torch policy and ray.get_gpu_ids on Windows

Hello,

I have an error on Windows when using PPO with Pytorch:

...\envs\rllib-pt\lib\site-packages\ray\rllib\policy\torch_policy.py", line 155, in __init__
(pid=18860)     self.device = self.devices[0]
(pid=18860) IndexError: list index out of range

I ensured that torch.cuda.is_available() returns True. The error is due to ray.get_gpu_ids that returns an empty list, even when num_gpus was set in ray.init. It is interesting that ray.resource_spec._autodetect_num_gpus() returns 1 so we can imagine a fix with this but maybe there is another solution. I am using ray==1.4.

Best,

1 Like

Try downloading to Ray==1.2, lmk if it still doesn’t work!

Hi @michaelzhiluo , thanks for the answer!
However it still does not work with ray==1.2.0.

I got the same error in Ubuntu. This code:

import ray
import ray.rllib.agents.ppo as ppo
import torch

print(torch.cuda.is_available())
ray.init()
config = ppo.DEFAULT_CONFIG
config["num_gpus"] = 1
config["framework"] = "torch"
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

returns “True” for cuda and this at the end of the the Traceback:

 File "/home/bukovskiy/.local/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 159, in __init__
    self.device = self.devices[0]
IndexError: list index out of range

referring to these lines in torch_policy.py:

self.devices = [
                torch.device("cuda:{}".format(i))
                for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
            ]
            self.device = self.devices[0]

Thanks for the test on Ubuntu and the reproduction code!
I am not sure why we need to call ray.get_gpu_ids as the only info we need is the number of gpus set in the config. What do you think @sven1977?

Any update on this? @michaelzhiluo

Try import tensorflow as tf and check if tf.config.list_physical_devices('GPU') returns a list of GPUs. Make sure you have CUDA and CUDNN installed properly.

Hey @Fabien-Couthouis , thanks for posting this question. We check the actually visible GPUs via ray.get_gpu_ids to make sure we build the correct policy. Using config.num_gpus would not work here as on the rollout workers no GPUs should be used (but their copy of the Policy still has config.num_gpus>0).
This may actually be a ray on Win bug? @sangcho @Alex , any ideas here?

Maybe related to this? [Core] GPU assignment via CUDA_VISIBLE_DEVICES is broken when using placement groups with "ray start --head" · Issue #16614 · ray-project/ray · GitHub @sven1977 can you confirm that?

Hello,
The issue does not seem to be fixed with this pull request: [Core] GPU assignment via CUDA_VISIBLE_DEVICES is broken when using placement groups with "ray start --head" · Issue #16614 · ray-project/ray · GitHub (i.e. I still have an empty list with ray.get_gpu_ids()).