Can RLlib use GPU accelerator?

Roller44 · November 29, 2021, 6:25am

I run RLlib programs using a PC with a GTX 1060 GPU and I have noticed the following line in the terminal:

Resources requested: 10.0/10 CPUs, 0.9999999999999999/1 GPUs, 0.0/18.28 GiB heap, 0.0/9.14 GiB objects (0.0/1.0 accelerator_type:G)

Apparently, the GPU’s accelerator is not used.

My question is: how can I use the GPU’s accelerator?

+++++++++++++++++++++++++++++++++++++++++++++++++++++
Edit:

I use Tune and RLlib to find the optimal hyperparameters for an RL agent, and I set num_gpus=1 to ray.init() and set "num_gpus: 1" in the config dict.

When typing the command nvidia-smi in the terminal, I got:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1C:00.0 Off |                  N/A |
|  0%   53C    P2    29W / 120W |   4889MiB /  6070MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       979      G   /usr/lib/xorg/Xorg                146MiB |
|    0   N/A  N/A      1145      G   /usr/bin/gnome-shell                5MiB |
|    0   N/A  N/A   1071629      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071630      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071631      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071632      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071633      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071634      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071635      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071636      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071637      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071638      C   ray::DQN.train_buffered()         473MiB |
+-----------------------------------------------------------------------------+

Obviously, the RLlib program used the GPU to train the RL agent.

However, I think the program did not adopt the GPU’s accelerator, because the terminal outputs

0.0/18.28 GiB heap, 0.0/9.14 GiB objects (0.0/1.0 accelerator_type:G)

during training.

My goal is to enable the RLlib program to adopt the GPU’s accelerator so that it can (hopefully) speed up the training.

bmanczak · November 29, 2021, 11:17am

Did you specify "num_gpus":1 in the config supplied to the trainer?

Roller44 · November 29, 2021, 1:22pm

@bmanczak Yes. I did.

mannyv · November 30, 2021, 2:05am

Did you also add “num_gpus=1” to ray.init?

Have you checked that the tensor library is working correctly with gpu support?

Roller44 · November 30, 2021, 2:22am

@mannyv Yes, I also did that.

mannyv · November 30, 2021, 2:25am

Does it start running?

If so it looks like it is registering your request to use 1 gpu.

Have you run nvidia-smi to see if the gpu is used during the run?

Roller44 · November 30, 2021, 2:50am

Yes.

I didn’t clearly describe my situation in the post, sorry about that.

I use Tune and RLlib to find the optimal hyperparameters for an RL agent, and I set num_gpus=1 to ray.init() and set "num_gpus: 1" in the config dict.

When training the RL agent, the terminal outputs the following:

Resources requested: 10.0/10 CPUs, 0.9999999999999999/1 GPUs, 0.0/18.28 GiB heap, 0.0/9.14 GiB objects (0.0/1.0 accelerator_type:G)

Furthermore, when typing the command nvidia-smi in the terminal, I got:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1C:00.0 Off |                  N/A |
|  0%   53C    P2    29W / 120W |   4889MiB /  6070MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       979      G   /usr/lib/xorg/Xorg                146MiB |
|    0   N/A  N/A      1145      G   /usr/bin/gnome-shell                5MiB |
|    0   N/A  N/A   1071629      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071630      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071631      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071632      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071633      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071634      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071635      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071636      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071637      C   ray::DQN.train_buffered()         473MiB |
|    0   N/A  N/A   1071638      C   ray::DQN.train_buffered()         473MiB |
+-----------------------------------------------------------------------------+

Obviously, the RLlib program used the GPU to train the RL agent, but it did not adopt the GPU’s accelerator.

My goal is to enable the RLlib program to adopt the GPU’s accelerator so that it can (hopefully) speed up the training.

I will add the above info to the original post.

mannyv · November 30, 2021, 3:09am

@Roller44,

It is using the gpu as an accelerator. The accelerator_type setting is a way of specifying what type of resource an actor or task should run on. For example let’s say you had 2 nodes with different types of gpus. You could use the accelerator type to say this needs to run only on the node that has accelerator_type=x.

https://docs.ray.io/en/latest/advanced.html?highlight=accelerator_type#accelerator-types

Topic		Replies	Views
RLlib slows down when gpu available but not used RLlib	0	352	April 7, 2021
PPO: GPU available, but not utilized Debugging and performance tuning	4	98	April 1, 2025
Training and inference ONLY using GPUs and no CPUs RLlib	7	1820	April 12, 2021
GPU Detected but Not Utilized in Ray RLlib with PPO RLlib	1	537	June 15, 2024
Questions about using GPU for the ray[rllib] RLlib	4	1888	August 4, 2023

Can RLlib use GPU accelerator?

Related topics