Error when running on GPU

I am trying to run the following code on a VM with 2 GPUs. (ray 1.4 / python 3.7.3 / torch 1.8.1)

import logging
import ray

import ray.rllib.agents.ppo as ppo

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

_ = ray.init(ignore_reinit_error=True)

config = ppo.DEFAULT_CONFIG.copy()
config["num_gpus"] = 1
config["num_workers"] = 2
config["framework"] = "torch"

trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

The last part of the stack trace is

~/.conda/envs/py373_cuda/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py in __init__(self, observation_space, action_space, config, model, loss, action_distribution_class, action_sampler_fn, action_distribution_fn, max_seq_len, get_batch_divisibility_req)
    153                 for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
    154             ]
--> 155             self.device = self.devices[0]
    156             ids = [
    157                 id_ for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]

IndexError: list index out of range

This works fine when I set num_gpus = 0 . Also this code works just fine on GPUs
Trials placed on the same GPU on a 2 GPU machine despite "num_gpus": 1

Looking at torch_policy.py line 956 has gpu_ids = ray.get_gpu_ids(). When I run ray.get_gpu_ids() I get an empty list.

Check out this git issue: [rllib][regression] num_gpus=1 on a device with GPU uses CPU for Pytorch in 0.8.7 · Issue #10271 · ray-project/ray · GitHub

Also, check if torch/tensorflow is detecting gpus!

torch.cuda.is_available() returns True.

And I am running Ray v1.4. The relevant snippet of code (when compared to the github fix)

        if config["_fake_gpus"] or config["num_gpus"] == 0 or \
                not torch.cuda.is_available():

I am not setting num_gpus or _fake_gpus in the config and not is_available will be False as well. So, the control is passed to the else part of that block where it is failing.

I got the same error. This code:

import ray
import ray.rllib.agents.ppo as ppo
import torch

print(torch.cuda.is_available())
ray.init()
config = ppo.DEFAULT_CONFIG
config["num_gpus"] = 1
config["framework"] = "torch"
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

returns “True” for cuda and this at the end of the the Traceback:

 File "/home/bukovskiy/.local/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 159, in __init__
    self.device = self.devices[0]
IndexError: list index out of range

referring to these lines in torch_policy.py:

self.devices = [
                torch.device("cuda:{}".format(i))
                for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
            ]
            self.device = self.devices[0]
1 Like

Yes, we should have a better error message for this case:

  • torch.cuda.is_available() returns True, but
  • ray.get_gpu_ids returns an empty list

Not sure what the solution here is, though. Could this be a ray core issue not detecting the GPUs on some machines?

I had the same error and fixed it by locally copying the rllib repository to my project.

Hey everyone (@ironv , @Guy_Tennenholtz , @Vladimir_Uspenskii , @michaelzhiluo :slight_smile: ), thanks for this discussion and surfacing these issues.
We did make lots of improvements on the multi-GPU/GPU frontier recently and a lot of these bugs should be fixed by now in the current master.
We also deployed nightly 2-GPU learning tests for all major algos and both tf and torch. We’ll add LSTM=True 2-GPU tests for all RNN-supporting algos in the next 1-2 weeks as well.

I also encountered this problem. I don’t know how to solve it. Have you solved this problem?