Error when running on GPU

ironv · July 1, 2021, 2:00am

I am trying to run the following code on a VM with 2 GPUs. (ray 1.4 / python 3.7.3 / torch 1.8.1)

import logging
import ray

import ray.rllib.agents.ppo as ppo

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

_ = ray.init(ignore_reinit_error=True)

config = ppo.DEFAULT_CONFIG.copy()
config["num_gpus"] = 1
config["num_workers"] = 2
config["framework"] = "torch"

trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

The last part of the stack trace is

~/.conda/envs/py373_cuda/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py in __init__(self, observation_space, action_space, config, model, loss, action_distribution_class, action_sampler_fn, action_distribution_fn, max_seq_len, get_batch_divisibility_req)
    153                 for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
    154             ]
--> 155             self.device = self.devices[0]
    156             ids = [
    157                 id_ for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]

IndexError: list index out of range

This works fine when I set num_gpus = 0 . Also this code works just fine on GPUs
Trials placed on the same GPU on a 2 GPU machine despite "num_gpus": 1

ironv · July 1, 2021, 3:11am

Looking at torch_policy.py line 956 has gpu_ids = ray.get_gpu_ids(). When I run ray.get_gpu_ids() I get an empty list.

michaelzhiluo · July 1, 2021, 7:17am

Check out this git issue: [rllib][regression] num_gpus=1 on a device with GPU uses CPU for Pytorch in 0.8.7 · Issue #10271 · ray-project/ray · GitHub

Also, check if torch/tensorflow is detecting gpus!

ironv · July 1, 2021, 1:26pm

torch.cuda.is_available() returns True.

And I am running Ray v1.4. The relevant snippet of code (when compared to the github fix)

        if config["_fake_gpus"] or config["num_gpus"] == 0 or \
                not torch.cuda.is_available():

I am not setting num_gpus or _fake_gpus in the config and not is_available will be False as well. So, the control is passed to the else part of that block where it is failing.

Vladimir_Uspenskii · July 1, 2021, 8:50pm

I got the same error. This code:

import ray
import ray.rllib.agents.ppo as ppo
import torch

print(torch.cuda.is_available())
ray.init()
config = ppo.DEFAULT_CONFIG
config["num_gpus"] = 1
config["framework"] = "torch"
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

returns “True” for cuda and this at the end of the the Traceback:

 File "/home/bukovskiy/.local/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 159, in __init__
    self.device = self.devices[0]
IndexError: list index out of range

referring to these lines in torch_policy.py:

self.devices = [
                torch.device("cuda:{}".format(i))
                for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
            ]
            self.device = self.devices[0]

sven1977 · July 28, 2021, 7:34pm

Yes, we should have a better error message for this case:

torch.cuda.is_available() returns True, but
ray.get_gpu_ids returns an empty list

Not sure what the solution here is, though. Could this be a ray core issue not detecting the GPUs on some machines?

Guy_Tennenholtz · August 20, 2021, 11:42am

I had the same error and fixed it by locally copying the rllib repository to my project.

sven1977 · August 23, 2021, 7:17am

Hey everyone (@ironv , @Guy_Tennenholtz , @Vladimir_Uspenskii , @michaelzhiluo ), thanks for this discussion and surfacing these issues.
We did make lots of improvements on the multi-GPU/GPU frontier recently and a lot of these bugs should be fixed by now in the current master.
We also deployed nightly 2-GPU learning tests for all major algos and both tf and torch. We’ll add LSTM=True 2-GPU tests for all RNN-supporting algos in the next 1-2 weeks as well.

yz_x · September 24, 2021, 8:32am

I also encountered this problem. I don’t know how to solve it. Have you solved this problem?

GoingMyWay · February 23, 2022, 3:13am

Same issue. Anyone can help?

github.com/ray-project/ray

`ray.get_gpu_ids()` doesn't work with local_mode

opened 07:46PM - 15 Nov 17 UTC

closed 07:55AM - 30 Jun 20 UTC

rshin

bug P3

### System information - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Ubuntu 16.04 - **Ray installed from (source or binary)**: binary - **Ray version**: 0.2.1 - **Python version**: 2.7.12 - **Exact command to reproduce**:  ### Describe the problem `ray.get_gpu_ids()` crashes when `ray.init` was called with `PYTHON_MODE`.

With PyTorch on machines with one GPU, it cannot run.

I think ray.get_gpu_ids() failed to get the ids of GPU. ray.get_gpu_ids() is everywhere in RLlib and now it seems I cannot use RLlib with Pytorch.

Topic		Replies	Views
Error with torch policy and ray.get_gpu_ids on Windows RLlib	9	1223	July 30, 2021
PPO example cannot use GPU RLlib	4	460	August 7, 2021
Ray not finding available GPU on Windows RLlib	4	915	September 6, 2021
Error when trying to use gpus during RL training RLlib	4	606	July 21, 2021
Questions about using GPU for the ray[rllib] RLlib	4	1502	August 4, 2023

Error when running on GPU

Related Topics