RL Trial Stuck at pending when trying to use Multi-GPU

Benjamin_Winter · October 7, 2021, 2:46pm

Hi, this is likely just some misunderstanding on my part, but I can’t figure out why my trial with some settings doesn’t run.

In particular, for my use case I want to use (fractional) GPUs for my inference workers since inference on CPU is not really tractable, as well as multiple GPU for training. I am trying to achieve this with the IMPALATrainer and using Pytorch. If I only use a single GPU for training and set:

num_workers = 2
num_gpus_per_worker = 0.5
num_cpus_per_worker = 32
num_envs_per_worker = 32
num_gpus = 1

the training runs perfectly fine with 2 GPUS visible to my deployment. However, if I try to use multiple GPUs for training and set

num_workers = 2
num_gpus_per_worker = 0.5
num_cpus_per_worker = 32
num_envs_per_worker = 32
num_gpus = 2

I would assume, that it now requires 3 GPUs. However, my trial gets stuck at pending, with ray repeatedly outputting the message:

== Status ==
Memory usage on this node: 54.7/1007.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/256 CPUs, 0/3 GPUs, 0.0/77.84 GiB heap, 0.0/37.35 GiB objects
Result logdir: /root/ray-results/IMPALA_2021-10-07_14-29-36
Number of trials: 1/1 (1 PENDING)
+---------------------------------------------+----------+-------+
| Trial name                                  | status   | loc   |
|---------------------------------------------+----------+-------|
| IMPALA_gym_test:testgym-v0_ff3bc_00000      | PENDING  |       |
+---------------------------------------------+----------+-------+

I’m following mostly the example in the docs running IMPALA via tune like this:

  ray.tune.run(
        IMPALA.ImpalaTrainer,
        config=config,
        stop=stop,
        local_dir=args.log_dir,
        reuse_actors=True
    )

with a slightly modified IMPALA.DEFAULT_CONFIG with the above variables.
Have i misunderstood something with the GPU allocation in RLLib, or are there other parameters that determine scheduling?

mickelliu · October 11, 2021, 11:21am

Check if this Github issue thread will help you: [rllib][tune] Training stuck in "Pending" status · Issue #16425 · ray-project/ray · GitHub

Benjamin_Winter · October 13, 2021, 9:55am

This issue did indeed help me fix my problem, thank you!

Topic		Replies	Views
Rllib workers ignoring GPU restrictions RLlib	2	649	December 22, 2020
How do I set GPU affinity of workers RLlib	17	2466	April 23, 2021
Impala does not respect GPU allocation RLlib	4	605	February 26, 2021
Total Workers == (Number of GPUS) - 1? Configure Algorithm, Training, Evaluation, Scaling	1	1163	February 9, 2023
[RLlib] Ray trains extremely slow when learner queue is full RLlib	7	2171	May 3, 2021

RL Trial Stuck at pending when trying to use Multi-GPU

Related topics