Hello, I am trying to run atleast two different training experiments on my local machine. I tried a configuration which would allow me to use my resources wisely:
policy_conf['num_workers'] = 1
policy_conf['num_envs_per_worker'] = 1
policy_conf['num_gpus'] = 0.3 # total GPUs on machine = 1
policy_conf['num_gpus_per_worker'] = 0
policy_conf['num_cpus_for_driver'] = 0
policy_conf['num_cpus_per_worker'] = 4 # total CPUs on machine = 12
policy_conf['evaluation_num_workers'] = 1
From what I understand, this should use only 30% of my GPU memory for the driver (remote worker 0) in order for the inference (training) which should compute to around 1.2 GB and leave the rest of the 4GB GPU memory alone.
…but it seems to ignore those instructions as per the output of my nvidia-smi
command.
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1325 G /usr/lib/xorg/Xorg 28MiB |
| 0 N/A N/A 1484 G /usr/bin/gnome-shell 47MiB |
| 0 N/A N/A 2236 G /usr/lib/xorg/Xorg 216MiB |
| 0 N/A N/A 2408 G /usr/bin/gnome-shell 93MiB |
| 0 N/A N/A 8503 C ray::PPO.train_buffered() 3097MiB |
| 0 N/A N/A 13519 G ...AAAAAAAAA= --shared-files 82MiB |
| 0 N/A N/A 27589 G ...AAAAAAAAA= --shared-files 35MiB |
+-----------------------------------------------------------------------------+
Is there a reason for this? How can I enforce memory restrictions?