I have a machine with two GPUs, but when I ran multiple RLlib jobs in parallel and assign 0.5GPU to each job, they are all allocated to only one GPU which is being fully utilized (sitting at 100%) while the other GPU is idle. Is there a way to assign jobs to specific GPUs or configure RLlib to allocate resources more efficiently?
I think @kmeco is saying that he is running n RLlib jobs (tune trials) in parallel. So this should work regardless of torch vs tf.
What we refer to as multi-GPU in RLlib means that a single trial (one run) can utilize multiple GPUs by splitting up the batch into n sub-batches, feeding them through the network in parallel (on each GPU) and merging/averaging the gradients after that to update a central model. Yes, this is only currently supported by tf, but this is not the case here.
We have never really tested or supported partial GPUs (e.g. 0.5) in RLlib. It’s on my list to take a look at this quarter, but I haven’t gotten to it yet. What could have happened in your case is that GPUs were rounded down to 0 and hence you are only using the first GPU (for all your models). If you try setting num_gpus=1 and only run 2 trials at a time, both GPUs should be utilized.
Another question: Are you running your experiments through ray tune?