Utilize multiple GPUs more efficiently

kmeco · February 25, 2021, 11:23pm

I have a machine with two GPUs, but when I ran multiple RLlib jobs in parallel and assign 0.5GPU to each job, they are all allocated to only one GPU which is being fully utilized (sitting at 100%) while the other GPU is idle. Is there a way to assign jobs to specific GPUs or configure RLlib to allocate resources more efficiently?

Thanks,
Jakub.

aronsar · February 26, 2021, 4:59am

Are you using TF or PyTorch? From what I’ve read on the docs, PyTorch doesn’t support multi-gpu usage yet.

sven1977 · February 26, 2021, 8:53am

I think @kmeco is saying that he is running n RLlib jobs (tune trials) in parallel. So this should work regardless of torch vs tf.
What we refer to as multi-GPU in RLlib means that a single trial (one run) can utilize multiple GPUs by splitting up the batch into n sub-batches, feeding them through the network in parallel (on each GPU) and merging/averaging the gradients after that to update a central model. Yes, this is only currently supported by tf, but this is not the case here.
We have never really tested or supported partial GPUs (e.g. 0.5) in RLlib. It’s on my list to take a look at this quarter, but I haven’t gotten to it yet. What could have happened in your case is that GPUs were rounded down to 0 and hence you are only using the first GPU (for all your models). If you try setting num_gpus=1 and only run 2 trials at a time, both GPUs should be utilized.

Another question: Are you running your experiments through ray tune?

Topic		Replies	Views
Rllib workers ignoring GPU restrictions RLlib	2	659	December 22, 2020
Training and inference ONLY using GPUs and no CPUs RLlib	7	1898	April 12, 2021
How do I set GPU affinity of workers RLlib	17	2508	April 23, 2021
[RAY SGD] Train pytorch model on machine with 2 GPUs Ray Tune	2	435	February 19, 2021
RL Trial Stuck at pending when trying to use Multi-GPU RLlib	2	1450	October 13, 2021

Utilize multiple GPUs more efficiently

Related topics