Can RLlib use GPU accelerator?

I run RLlib programs using a PC with a GTX 1060 GPU and I have noticed the following line in the terminal:

Resources requested: 10.0/10 CPUs, 0.9999999999999999/1 GPUs, 0.0/18.28 GiB heap, 0.0/9.14 GiB objects (0.0/1.0 accelerator_type:G)

Apparently, the GPU’s accelerator is not used.

My question is: how can I use the GPU’s accelerator?

+++++++++++++++++++++++++++++++++++++++++++++++++++++
Edit:

I use Tune and RLlib to find the optimal hyperparameters for an RL agent, and I set num_gpus=1 to ray.init() and set "num_gpus: 1" in the config dict.

When typing the command nvidia-smi in the terminal, I got:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1C:00.0 Off |                  N/A |
|  0%   53C    P2    29W / 120W |   4889MiB /  6070MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       979      G   /usr/lib/xorg/Xorg                146MiB |
|    0   N/A  N/A      1145      G   /usr/bin/gnome-shell                5MiB |
|    0   N/A  N/A   1071629      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071630      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071631      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071632      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071633      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071634      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071635      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071636      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071637      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071638      C   ray::DQN.train_buffered()         473MiB |
+-----------------------------------------------------------------------------+

Obviously, the RLlib program used the GPU to train the RL agent.

However, I think the program did not adopt the GPU’s accelerator, because the terminal outputs

0.0/18.28 GiB heap, 0.0/9.14 GiB objects (0.0/1.0 accelerator_type:G)

during training.

My goal is to enable the RLlib program to adopt the GPU’s accelerator so that it can (hopefully) speed up the training.

Did you specify "num_gpus":1 in the config supplied to the trainer?

@bmanczak Yes. I did.

Did you also add “num_gpus=1” to ray.init?

Have you checked that the tensor library is working correctly with gpu support?

@mannyv Yes, I also did that.

Does it start running?

If so it looks like it is registering your request to use 1 gpu.

Have you run nvidia-smi to see if the gpu is used during the run?

Yes.

I didn’t clearly describe my situation in the post, sorry about that.

I use Tune and RLlib to find the optimal hyperparameters for an RL agent, and I set num_gpus=1 to ray.init() and set "num_gpus: 1" in the config dict.

When training the RL agent, the terminal outputs the following:

Resources requested: 10.0/10 CPUs, 0.9999999999999999/1 GPUs, 0.0/18.28 GiB heap, 0.0/9.14 GiB objects (0.0/1.0 accelerator_type:G)

Furthermore, when typing the command nvidia-smi in the terminal, I got:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1C:00.0 Off |                  N/A |
|  0%   53C    P2    29W / 120W |   4889MiB /  6070MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       979      G   /usr/lib/xorg/Xorg                146MiB |
|    0   N/A  N/A      1145      G   /usr/bin/gnome-shell                5MiB |
|    0   N/A  N/A   1071629      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071630      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071631      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071632      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071633      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071634      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071635      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071636      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071637      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071638      C   ray::DQN.train_buffered()         473MiB |
+-----------------------------------------------------------------------------+

Obviously, the RLlib program used the GPU to train the RL agent, but it did not adopt the GPU’s accelerator.

My goal is to enable the RLlib program to adopt the GPU’s accelerator so that it can (hopefully) speed up the training.

I will add the above info to the original post.

@Roller44,

It is using the gpu as an accelerator. The accelerator_type setting is a way of specifying what type of resource an actor or task should run on. For example let’s say you had 2 nodes with different types of gpus. You could use the accelerator type to say this needs to run only on the node that has accelerator_type=x.

https://docs.ray.io/en/latest/advanced.html?highlight=accelerator_type#accelerator-types

1 Like