GPU memory allocation exceeding configuration

Hello, I am trying to run atleast two different training experiments on my local machine. I tried a configuration which would allow me to use my resources wisely:

policy_conf['num_workers'] = 1
policy_conf['num_envs_per_worker'] = 1
policy_conf['num_gpus'] = 0.3 # total GPUs on machine = 1
policy_conf['num_gpus_per_worker'] = 0
policy_conf['num_cpus_for_driver'] = 0
policy_conf['num_cpus_per_worker'] = 4 # total CPUs on machine = 12
policy_conf['evaluation_num_workers'] = 1

From what I understand, this should use only 30% of my GPU memory for the driver (remote worker 0) in order for the inference (training) which should compute to around 1.2 GB and leave the rest of the 4GB GPU memory alone.
…but it seems to ignore those instructions as per the output of my nvidia-smi command.

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A      1325      G   /usr/lib/xorg/Xorg                 28MiB |
|    0   N/A  N/A      1484      G   /usr/bin/gnome-shell               47MiB |
|    0   N/A  N/A      2236      G   /usr/lib/xorg/Xorg                216MiB |
|    0   N/A  N/A      2408      G   /usr/bin/gnome-shell               93MiB |
|    0   N/A  N/A      8503      C   ray::PPO.train_buffered()        3097MiB |
|    0   N/A  N/A     13519      G   ...AAAAAAAAA= --shared-files       82MiB |
|    0   N/A  N/A     27589      G   ...AAAAAAAAA= --shared-files       35MiB |

Is there a reason for this? How can I enforce memory restrictions?

1 Like

Hey @hridayns , great question. The thing with fractional GPUs (e.g. num_gpus=0.3) is that this won’t limit the memory used by a single GPU user (e.g. a trial). What num_gpus=0.333 does is it allows three different RLlib Trainers to be run on the same GPU (not considering their memory usage!). Tune and RLlib can’t know in advance how much memory each Trainer will consume as this heavily depends on the algo and model being used.

1 Like

Thank you for the answer @sven1977 ! Does this mean that the same configuration if used on a machine with more resources (for example, 4 GPU cores and 96 CPU cores), the resources used will also increase?