RL Trial Stuck at pending when trying to use Multi-GPU

Hi, this is likely just some misunderstanding on my part, but I can’t figure out why my trial with some settings doesn’t run.

In particular, for my use case I want to use (fractional) GPUs for my inference workers since inference on CPU is not really tractable, as well as multiple GPU for training. I am trying to achieve this with the IMPALATrainer and using Pytorch. If I only use a single GPU for training and set:

num_workers = 2
num_gpus_per_worker = 0.5
num_cpus_per_worker = 32
num_envs_per_worker = 32
num_gpus = 1

the training runs perfectly fine with 2 GPUS visible to my deployment. However, if I try to use multiple GPUs for training and set

num_workers = 2
num_gpus_per_worker = 0.5
num_cpus_per_worker = 32
num_envs_per_worker = 32
num_gpus = 2

I would assume, that it now requires 3 GPUs. However, my trial gets stuck at pending, with ray repeatedly outputting the message:

== Status ==
Memory usage on this node: 54.7/1007.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/256 CPUs, 0/3 GPUs, 0.0/77.84 GiB heap, 0.0/37.35 GiB objects
Result logdir: /root/ray-results/IMPALA_2021-10-07_14-29-36
Number of trials: 1/1 (1 PENDING)
+---------------------------------------------+----------+-------+
| Trial name                                  | status   | loc   |
|---------------------------------------------+----------+-------|
| IMPALA_gym_test:testgym-v0_ff3bc_00000      | PENDING  |       |
+---------------------------------------------+----------+-------+             

I’m following mostly the example in the docs running IMPALA via tune like this:

  ray.tune.run(
        IMPALA.ImpalaTrainer,
        config=config,
        stop=stop,
        local_dir=args.log_dir,
        reuse_actors=True
    )

with a slightly modified IMPALA.DEFAULT_CONFIG with the above variables.
Have i misunderstood something with the GPU allocation in RLLib, or are there other parameters that determine scheduling?

Check if this Github issue thread will help you: [rllib][tune] Training stuck in "Pending" status · Issue #16425 · ray-project/ray · GitHub

This issue did indeed help me fix my problem, thank you!