Hi, this is likely just some misunderstanding on my part, but I can’t figure out why my trial with some settings doesn’t run.
In particular, for my use case I want to use (fractional) GPUs for my inference workers since inference on CPU is not really tractable, as well as multiple GPU for training. I am trying to achieve this with the IMPALATrainer and using Pytorch. If I only use a single GPU for training and set:
num_workers = 2
num_gpus_per_worker = 0.5
num_cpus_per_worker = 32
num_envs_per_worker = 32
num_gpus = 1
the training runs perfectly fine with 2 GPUS visible to my deployment. However, if I try to use multiple GPUs for training and set
num_workers = 2
num_gpus_per_worker = 0.5
num_cpus_per_worker = 32
num_envs_per_worker = 32
num_gpus = 2
I would assume, that it now requires 3 GPUs. However, my trial gets stuck at pending, with ray repeatedly outputting the message:
== Status ==
Memory usage on this node: 54.7/1007.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/256 CPUs, 0/3 GPUs, 0.0/77.84 GiB heap, 0.0/37.35 GiB objects
Result logdir: /root/ray-results/IMPALA_2021-10-07_14-29-36
Number of trials: 1/1 (1 PENDING)
+---------------------------------------------+----------+-------+
| Trial name | status | loc |
|---------------------------------------------+----------+-------|
| IMPALA_gym_test:testgym-v0_ff3bc_00000 | PENDING | |
+---------------------------------------------+----------+-------+
I’m following mostly the example in the docs running IMPALA via tune like this:
ray.tune.run(
IMPALA.ImpalaTrainer,
config=config,
stop=stop,
local_dir=args.log_dir,
reuse_actors=True
)
with a slightly modified IMPALA.DEFAULT_CONFIG with the above variables.
Have i misunderstood something with the GPU allocation in RLLib, or are there other parameters that determine scheduling?