It seems the trainers processes (e.g. impala) do not really respect num_gpus when it comes to resource allocation. This is likely related to Rllib workers ignoring GPU restrictions. With a config as such:
As you can see, the trainer ends up on the same GPU as another worker, even though each worker should take an entire GPU. So instead of running 6 environments, I’m limited to 3.
Learner seems to use a little more than the workers, but I do see the two workers also comsume some of the GPU sometimes (see second nvidia-smi output).
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 40C P0 51W / 300W | 1248MiB / 16160MiB | 9% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 39C P0 54W / 300W | 1238MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 42C P0 57W / 300W | 1238MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 38W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
(base) ray@ip-172-31-83-50:~/AttentionNetVizdoomBEnchmarks/ray$ nvidia-smi
Fri Feb 26 04:05:58 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 40C P0 50W / 300W | 1248MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 39C P0 54W / 300W | 1238MiB / 16160MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 42C P0 57W / 300W | 1238MiB / 16160MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 38W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
This is torch using the nightly wheel. In the simple case the overhead for the learner is small, but I think with larger observation spaces, custom models, and learner queues, etc. the memory usage blows up. Here is my config (custom model is loaded later):
This behavior seems to go away when using tune.run instead of trainer.train. It may be worth suggesting using tune instead of trainer.train to rllib users in the documentation, as the progress, autosaving, etc seems quite a bit nicer than trainer.train. I’ve been missing out! Using tune.run on a different config with 18 workers: