Impala does not respect GPU allocation

It seems the trainers processes (e.g. impala) do not really respect num_gpus when it comes to resource allocation. This is likely related to Rllib workers ignoring GPU restrictions. With a config as such:

"num_workers": 3,
        "num_gpus_per_worker": 1,
        "num_gpus": 1,

You would expect each rollout worker gets a GPU 0-2 and the trainer worker gets GPU3. However, look at the memory usage from nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:18:00.0 Off |                  N/A |
| 41%   76C    P2   130W / 250W |  10348MiB / 11019MiB |     85%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:3B:00.0 Off |                  N/A |
| 30%   46C    P2    67W / 250W |   4463MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:5E:00.0 Off |                  N/A |
| 27%   37C    P2    55W / 250W |   4463MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:86:00.0 Off |                  N/A |
| 28%   36C    P8     1W / 250W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

As you can see, the trainer ends up on the same GPU as another worker, even though each worker should take an entire GPU. So instead of running 6 environments, I’m limited to 3.

Hey @smorad Thanks for filing this. Taking a look right now. …

Btw, this is torch or tf?

Running a simple CartPole train on a 4 GPU machine:

rllib train --run PPO --env CartPole-v0 --config='{"num_gpus_per_worker": 1, "num_gpus": 1, "num_workers": 2, "framework": "torch"}

Learner seems to use a little more than the workers, but I do see the two workers also comsume some of the GPU sometimes (see second nvidia-smi output).

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   40C    P0    51W / 300W |   1248MiB / 16160MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   39C    P0    54W / 300W |   1238MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   42C    P0    57W / 300W |   1238MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base) ray@ip-172-31-83-50:~/AttentionNetVizdoomBEnchmarks/ray$ nvidia-smi
Fri Feb 26 04:05:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   40C    P0    50W / 300W |   1248MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   39C    P0    54W / 300W |   1238MiB / 16160MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   42C    P0    57W / 300W |   1238MiB / 16160MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

This is torch using the nightly wheel. In the simple case the overhead for the learner is small, but I think with larger observation spaces, custom models, and learner queues, etc. the memory usage blows up. Here is my config (custom model is loaded later):

{
    "framework": "torch",
    "model": {
        "framestack": false
    },
    "num_workers": 3,
    "num_gpus_per_worker": 1,
    "num_gpus": 1,
    "rollout_fragment_length": 256,
    "train_batch_size": 1024,
    "lr": 0.0001
}

I am using an env that requires GPU

This behavior seems to go away when using tune.run instead of trainer.train. It may be worth suggesting using tune instead of trainer.train to rllib users in the documentation, as the progress, autosaving, etc seems quite a bit nicer than trainer.train. I’ve been missing out! Using tune.run on a different config with 18 workers:

Fri Feb 26 18:22:42 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:18:00.0 Off |                  N/A |
| 27%   34C    P8    19W / 250W |   3788MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:3B:00.0 Off |                  N/A |
| 28%   36C    P8    20W / 250W |   8485MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:5E:00.0 Off |                  N/A |
| 27%   32C    P8     6W / 250W |   8485MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:86:00.0 Off |                  N/A |
| 27%   34C    P8     1W / 250W |   8485MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
1 Like