Error when trying to use gpus during RL training

I am in the final stages of a project I’ve been working on for a while now in RLlib and as I try to train my model using the gpu (and the Tune API with config[“num_gpus”] = 1), I can’t seem to get it to run without throwing errors.

Specifically, when I try to train my agent, I get an error thrown from here (Line 157) essentially telling me that len(self.devices) is 0 and that no GPUS are being detected.

Initially I thought it was because my GPU was not set up to work with PyTorch (which is the framework I am using for my project), but after running a simple test with torch.cuda.is_available(), torch.cuda.device(0), and torch.cuda.get_device_name(0) I can see that my GPU is being recognized by Torch (RTX 2060-Max Q, just for reference).

Has anyone encountered this error before and are there any workarounds to it? I saw someone suggest removing config["num_gpus] = 1 here from the tune.run config call, but that seems to just cause the PyTorch policies to run on my CPU (and they train properly), which is not what I wanted.

Thanks for your help.

For reference, here is the error:

ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=5544, ip=10.0.0.37)
  File "python\ray\_raylet.pyx", line 501, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\_private\function_manager.py", line 563, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\agents\trainer_template.py", line 123, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\agents\trainer.py", line 548, in __init__
    super().__init__(config, logger_creator)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\tune\trainable.py", line 98, in __init__
    self.setup(copy.deepcopy(self.config))
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\agents\trainer.py", line 709, in setup
    self._init(self.config, self.env_creator)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\agents\trainer_template.py", line 155, in _init
    num_workers=self.config["num_workers"])
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\agents\trainer.py", line 797, in _make_workers
    logdir=self.logdir)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 83, in __init__
    lambda p, pid: (pid, p.observation_space, p.action_space)))
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\_private\client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\worker.py", line 1497, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=11476, ip=10.0.0.37)
  File "python\ray\_raylet.pyx", line 501, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\_private\function_manager.py", line 563, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 537, in __init__
    policy_dict, policy_config)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1196, in _build_policy_map
    policy_map[name] = cls(obs_space, act_space, merged_conf)
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\policy\policy_template.py", line 267, in __init__
    get_batch_divisibility_req=get_batch_divisibility_req,
  File "C:\Users\408aa\Anaconda3\envs\rl_env\lib\site-packages\ray\rllib\policy\torch_policy.py", line 155, in __init__
    self.device = self.devices[0]
IndexError: list index out of range

Hey @cl_tch , could you give us more information about your machine and package versions?
ray.get_gpu_ids() probably returns an empty list (which we should catch better).

Sure, my machine is running on an AMD Ryzen 4900HS and has an RTX 2060-MaxQ (which, again is detected by Torch and Tensorflow just fine) on Windows 10. ray.get_gpu_ids() is indeed returning an empty list, but I definitely have a CUDA compatible GPU as mentioned above so I’m not sure why ray isn’t detecting it.

For package versions, I am running on an Anaconda virtual environment with Python version 3.7.10, ray version 1.4.1, numpy version 1.21.0 (which also throws a lot of deprecation/runtime warnings within rllib files; just thought I’d mention this), torch version 1.9.0, and tensorflow version 1.15.0 (although my custom model is using torch so the tf version should be a non factor).

Thanks for your help!

@sven1977 Do you know any workarounds to get the GPU to work with Torch policies? I have seen quite a few GitHub issue posts on this but none of them seem to find a viable solution (the one’s that work all seem to resort to training the policy on the CPU to get it to work).