Error with torch policy and ray.get_gpu_ids on Windows

Fabien-Couthouis · June 30, 2021, 8:13am

Hello,

I have an error on Windows when using PPO with Pytorch:

...\envs\rllib-pt\lib\site-packages\ray\rllib\policy\torch_policy.py", line 155, in __init__
(pid=18860)     self.device = self.devices[0]
(pid=18860) IndexError: list index out of range

I ensured that torch.cuda.is_available() returns True. The error is due to ray.get_gpu_ids that returns an empty list, even when num_gpus was set in ray.init. It is interesting that ray.resource_spec._autodetect_num_gpus() returns 1 so we can imagine a fix with this but maybe there is another solution. I am using ray==1.4.

Best,

michaelzhiluo · July 1, 2021, 7:19am

Try downloading to Ray==1.2, lmk if it still doesn’t work!

Fabien-Couthouis · July 1, 2021, 8:38am

Hi @michaelzhiluo , thanks for the answer!
However it still does not work with ray==1.2.0.

Vladimir_Uspenskii · July 1, 2021, 8:48pm

I got the same error in Ubuntu. This code:

import ray
import ray.rllib.agents.ppo as ppo
import torch

print(torch.cuda.is_available())
ray.init()
config = ppo.DEFAULT_CONFIG
config["num_gpus"] = 1
config["framework"] = "torch"
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

returns “True” for cuda and this at the end of the the Traceback:

 File "/home/bukovskiy/.local/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 159, in __init__
    self.device = self.devices[0]
IndexError: list index out of range

referring to these lines in torch_policy.py:

self.devices = [
                torch.device("cuda:{}".format(i))
                for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
            ]
            self.device = self.devices[0]

Fabien-Couthouis · July 2, 2021, 12:53pm

Thanks for the test on Ubuntu and the reproduction code!
I am not sure why we need to call ray.get_gpu_ids as the only info we need is the number of gpus set in the config. What do you think @sven1977?

Fabien-Couthouis · July 13, 2021, 11:15am

Any update on this? @michaelzhiluo

michaelzhiluo · July 14, 2021, 6:18am

Try import tensorflow as tf and check if tf.config.list_physical_devices('GPU') returns a list of GPUs. Make sure you have CUDA and CUDNN installed properly.

sven1977 · July 14, 2021, 3:53pm

Hey @Fabien-Couthouis , thanks for posting this question. We check the actually visible GPUs via ray.get_gpu_ids to make sure we build the correct policy. Using config.num_gpus would not work here as on the rollout workers no GPUs should be used (but their copy of the Policy still has config.num_gpus>0).
This may actually be a ray on Win bug? @sangcho @Alex , any ideas here?

sangcho · July 15, 2021, 5:33pm

Maybe related to this? [Core] GPU assignment via CUDA_VISIBLE_DEVICES is broken when using placement groups with "ray start --head" · Issue #16614 · ray-project/ray · GitHub @sven1977 can you confirm that?

Fabien-Couthouis · July 30, 2021, 8:41am

Hello,
The issue does not seem to be fixed with this pull request: [Core] GPU assignment via CUDA_VISIBLE_DEVICES is broken when using placement groups with "ray start --head" · Issue #16614 · ray-project/ray · GitHub (i.e. I still have an empty list with ray.get_gpu_ids()).

Topic		Replies	Views
PPO example cannot use GPU RLlib	4	495	August 7, 2021
Error when running on GPU RLlib	9	2268	February 23, 2022
Ray not finding available GPU on Windows RLlib	4	992	September 6, 2021
PPO policy in RLIB claims No cuda gpus available despite GPUs being available RLlib	4	362	July 20, 2023
GPU Detected but Not Utilized in Ray RLlib with PPO RLlib	1	611	June 15, 2024

Error with torch policy and ray.get_gpu_ids on Windows

Related topics