GPU memory allocation exceeding configuration

hridayns · August 21, 2021, 3:22pm

Hello, I am trying to run atleast two different training experiments on my local machine. I tried a configuration which would allow me to use my resources wisely:

policy_conf['num_workers'] = 1
policy_conf['num_envs_per_worker'] = 1
policy_conf['num_gpus'] = 0.3 # total GPUs on machine = 1
policy_conf['num_gpus_per_worker'] = 0
policy_conf['num_cpus_for_driver'] = 0
policy_conf['num_cpus_per_worker'] = 4 # total CPUs on machine = 12
policy_conf['evaluation_num_workers'] = 1

From what I understand, this should use only 30% of my GPU memory for the driver (remote worker 0) in order for the inference (training) which should compute to around 1.2 GB and leave the rest of the 4GB GPU memory alone.
…but it seems to ignore those instructions as per the output of my nvidia-smi command.

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1325      G   /usr/lib/xorg/Xorg                 28MiB |
|    0   N/A  N/A      1484      G   /usr/bin/gnome-shell               47MiB |
|    0   N/A  N/A      2236      G   /usr/lib/xorg/Xorg                216MiB |
|    0   N/A  N/A      2408      G   /usr/bin/gnome-shell               93MiB |
|    0   N/A  N/A      8503      C   ray::PPO.train_buffered()        3097MiB |
|    0   N/A  N/A     13519      G   ...AAAAAAAAA= --shared-files       82MiB |
|    0   N/A  N/A     27589      G   ...AAAAAAAAA= --shared-files       35MiB |
+-----------------------------------------------------------------------------+

Is there a reason for this? How can I enforce memory restrictions?

sven1977 · August 23, 2021, 7:07am

Hey @hridayns , great question. The thing with fractional GPUs (e.g. num_gpus=0.3) is that this won’t limit the memory used by a single GPU user (e.g. a trial). What num_gpus=0.333 does is it allows three different RLlib Trainers to be run on the same GPU (not considering their memory usage!). Tune and RLlib can’t know in advance how much memory each Trainer will consume as this heavily depends on the algo and model being used.

hridayns · August 25, 2021, 8:55am

Thank you for the answer @sven1977 ! Does this mean that the same configuration if used on a machine with more resources (for example, 4 GPU cores and 96 CPU cores), the resources used will also increase?

Topic		Replies	Views
Rllib workers ignoring GPU restrictions RLlib	2	654	December 22, 2020
High Memory Usage RLlib	3	743	September 8, 2022
How to specify GPU resources in terms of GPU RAM and not fraction of GPU Ray Core	3	594	November 26, 2021
Training and inference ONLY using GPUs and no CPUs RLlib	7	1879	April 12, 2021
Gpu wise memory allocation Ray Tune	0	449	December 16, 2020

GPU memory allocation exceeding configuration

Related topics