Separating GPU's for learners and workers - Apex DQN

kawshik8 · February 17, 2022, 3:59pm

Creating N workers with P gpus results in each gpu worker using N/P fractional gpus if we provide num_gpus_per_worker. The learner GPU is by default present in the GPU - 0. Since the learner occupies the GPU 0 and workers are also present in GPU 0, this restricts the total number of workers in each GPU to be equal to the maximum possible number given the remaining memory in GPU 0 (after occupancy by learner)

Is there a way to disentangle the gpu usage for learners and workers?

arturn · March 10, 2022, 4:55am

Hi @kawshik8,

You can disentangling it via create_env_on_driver=False. You have to have rollout workers in this case though.
Have you tried setting num_gpus_per_worker=N/P and create_env_on_driver=False and see what happens?

Resource allocation is automatic and I have experimented with CUDA_VISIBLE_DEVICES a while ago but there is not easy way to tell RLlib “Put learner load on GPU 0 and rest on GPU 1” or something similar if that is what you are looking for.

kawshik8 · March 31, 2022, 2:45am

Hey @arturn

Thanks for the answer. The default value for “create_env_on_driver” is False and I believe rollout workers are also automatically created in the case of Apex DQN.

Rather than separate the use of GPUs for the learner and the worker, my goal is to make sure every GPU is used to its complete capacity. Currently if the learner uses a GPU the remaining memory on that GPU is the same amount of memory that can be used in every other GPU in the cluster. Is there a way to remove this equal allocation strategy?

arturn · March 31, 2022, 8:45pm

I’m sorry maxing out resource utilization is beyond my knowledge of this library Maybe @avnishn can chime in?

avnishn · April 11, 2022, 10:24pm

TLDR; No

Long answer:
In ray, ray actors hold on to the resources that they are created with for the entirety of their lifetimes. This means that if you create a learner that has a gpu and sampling workers that have gpus then they will hold those resources for the duration of the RLlib experiment.

The reason for that for the time being is that if actors aren’t guaranteed the resources necessary to run, then they will poll in a pending state, and its possible that the resources never become available or that there is a race condition on the resources. I think that in future versions of RLlib we could implement some smart logic to allow the transfer of gpus from samplers to the learner, but not in the near future.

Topic		Replies	Views
Training and inference ONLY using GPUs and no CPUs RLlib	7	1879	April 12, 2021
Total Workers == (Number of GPUS) - 1? Configure Algorithm, Training, Evaluation, Scaling	1	1201	February 9, 2023
Rllib workers ignoring GPU restrictions RLlib	2	654	December 22, 2020
Colocating workers and environments on the same GPU RLlib	1	277	January 19, 2021
Run DD-PPO in multiple GPUs RLlib	2	368	September 30, 2021

Separating GPU's for learners and workers - Apex DQN

Related topics