Impala does not respect GPU allocation

smorad · February 24, 2021, 8:41pm

It seems the trainers processes (e.g. impala) do not really respect num_gpus when it comes to resource allocation. This is likely related to Rllib workers ignoring GPU restrictions. With a config as such:

"num_workers": 3,
        "num_gpus_per_worker": 1,
        "num_gpus": 1,

You would expect each rollout worker gets a GPU 0-2 and the trainer worker gets GPU3. However, look at the memory usage from nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:18:00.0 Off |                  N/A |
| 41%   76C    P2   130W / 250W |  10348MiB / 11019MiB |     85%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:3B:00.0 Off |                  N/A |
| 30%   46C    P2    67W / 250W |   4463MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:5E:00.0 Off |                  N/A |
| 27%   37C    P2    55W / 250W |   4463MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:86:00.0 Off |                  N/A |
| 28%   36C    P8     1W / 250W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

As you can see, the trainer ends up on the same GPU as another worker, even though each worker should take an entire GPU. So instead of running 6 environments, I’m limited to 3.

sven1977 · February 26, 2021, 11:55am

Hey @smorad Thanks for filing this. Taking a look right now. …

Btw, this is torch or tf?

sven1977 · February 26, 2021, 12:08pm

Running a simple CartPole train on a 4 GPU machine:

rllib train --run PPO --env CartPole-v0 --config='{"num_gpus_per_worker": 1, "num_gpus": 1, "num_workers": 2, "framework": "torch"}

Learner seems to use a little more than the workers, but I do see the two workers also comsume some of the GPU sometimes (see second nvidia-smi output).

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   40C    P0    51W / 300W |   1248MiB / 16160MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   39C    P0    54W / 300W |   1238MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   42C    P0    57W / 300W |   1238MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base) ray@ip-172-31-83-50:~/AttentionNetVizdoomBEnchmarks/ray$ nvidia-smi
Fri Feb 26 04:05:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   40C    P0    50W / 300W |   1248MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   39C    P0    54W / 300W |   1238MiB / 16160MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   42C    P0    57W / 300W |   1238MiB / 16160MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

smorad · February 26, 2021, 12:36pm

This is torch using the nightly wheel. In the simple case the overhead for the learner is small, but I think with larger observation spaces, custom models, and learner queues, etc. the memory usage blows up. Here is my config (custom model is loaded later):

{
    "framework": "torch",
    "model": {
        "framestack": false
    },
    "num_workers": 3,
    "num_gpus_per_worker": 1,
    "num_gpus": 1,
    "rollout_fragment_length": 256,
    "train_batch_size": 1024,
    "lr": 0.0001
}

I am using an env that requires GPU

smorad · February 26, 2021, 6:25pm

This behavior seems to go away when using tune.run instead of trainer.train. It may be worth suggesting using tune instead of trainer.train to rllib users in the documentation, as the progress, autosaving, etc seems quite a bit nicer than trainer.train. I’ve been missing out! Using tune.run on a different config with 18 workers:

Fri Feb 26 18:22:42 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:18:00.0 Off |                  N/A |
| 27%   34C    P8    19W / 250W |   3788MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:3B:00.0 Off |                  N/A |
| 28%   36C    P8    20W / 250W |   8485MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:5E:00.0 Off |                  N/A |
| 27%   32C    P8     6W / 250W |   8485MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:86:00.0 Off |                  N/A |
| 27%   34C    P8     1W / 250W |   8485MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Topic		Replies	Views
Rllib workers ignoring GPU restrictions RLlib	2	653	December 22, 2020
How do I set GPU affinity of workers RLlib	17	2471	April 23, 2021
RL Trial Stuck at pending when trying to use Multi-GPU RLlib	2	1432	October 13, 2021
RLlib IMPALA multi GPU performance Configure Algorithm, Training, Evaluation, Scaling	3	599	March 19, 2023
GPU memory allocation exceeding configuration RLlib	2	786	August 25, 2021

Impala does not respect GPU allocation

Related topics