Configuring config for parallel GPU tune runs on a single server

This topic is Medium to Low, it contributes to some difficulty - but i can work around it.

Hi Folks,

First post to forum and relatively new to Rllib.

I am trying to tune a simple vanilla PPO with the following tuner specs:
tuner = tune.Tuner(
“PPO”,
param_space=config.to_dict(),
tune_config=tune.TuneConfig(num_samples=8, max_concurrent_trials=4),
run_config=air.RunConfig(stop=stop),
)

The config is default for algorithm, PPO, model, and iteration stop parameters. Only changes to config are those below related to resources.

I have a system with 2 GPUs and 48 CPU cores.

I have been working with all sorts of combinations of num_learner_workers, num_gpus_per_learner_worker, num_gpus, and have zeroed in on a behavior i can’t figure out:

when i set the configuration
config[‘num_gpus’] = 2
config[‘num_learner_workers’] = 2
config[‘num_gpus_per_learner_worker’] = 1

Tune proceeds in a serial fashion:

Current time: 2023-10-05 19:48:15 (running for 00:00:10.55)
Using FIFO scheduling algorithm.
Logical resource usage: 3.0/48 CPUs, 2.0/2 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/PPO
Number of trials: 4/8 (3 PENDING, 1 RUNNING)
±-------------------------------±---------±---------------------+
| Trial name | status | loc |
|--------------------------------±---------±---------------------|
| PPO_SimpleCorridor_19707_00000 | RUNNING | 192.168.229.86:36108 |
| PPO_SimpleCorridor_19707_00001 | PENDING | |
| PPO_SimpleCorridor_19707_00002 | PENDING | |
| PPO_SimpleCorridor_19707_00003 | PENDING | |
±-------------------------------±---------±---------------------+
(this repeats with only one trial running at a time)

But when i set the gpus to 1 (even though i physically have two):

config[‘num_gpus’] = 1
config[‘num_learner_workers’] = 2
config[‘num_gpus_per_learner_worker’] = 1

tune finds the other GPU and starts running the trials in parallel:

== Status ==
Current time: 2023-10-05 19:53:09 (running for 00:00:09.86)
Using FIFO scheduling algorithm.
Logical resource usage: 3.0/48 CPUs, 1.0/2 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/PPO
Number of trials: 4/8 (3 PENDING, 1 RUNNING)
±-------------------------------±---------±---------------------+
| Trial name | status | loc |
|--------------------------------±---------±---------------------|
| PPO_SimpleCorridor_c93ec_00000 | RUNNING | 192.168.229.86:27964 |
| PPO_SimpleCorridor_c93ec_00001 | PENDING | |
| PPO_SimpleCorridor_c93ec_00002 | PENDING | |
| PPO_SimpleCorridor_c93ec_00003 | PENDING | |
±-------------------------------±---------±---------------------+

== Status ==
Current time: 2023-10-05 19:53:19 (running for 00:00:20.19)
Using FIFO scheduling algorithm.
Logical resource usage: 6.0/48 CPUs, 2.0/2 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/PPO
Number of trials: 4/8 (2 PENDING, 2 RUNNING)
±-------------------------------±---------±---------------------+
| Trial name | status | loc |
|--------------------------------±---------±---------------------|
| PPO_SimpleCorridor_c93ec_00000 | RUNNING | 192.168.229.86:27964 |
| PPO_SimpleCorridor_c93ec_00001 | RUNNING | 192.168.229.86:29636 |
| PPO_SimpleCorridor_c93ec_00002 | PENDING | |
| PPO_SimpleCorridor_c93ec_00003 | PENDING | |
±-------------------------------±---------±---------------------+
(the trials process running two at a time)

I’m super stumped. I’m on ray version 2.4 as i have an older server and OS configuration wherein the drivers are holding me back to this version of ray. Not sure if this is something fixed in later versions.

Although this latter configuration is working - it is not intuitive and i imagine i’m not using the configuration parameters correctly.

Any help most welcome.

Cheers!

-felgryn

@Felgryn welcome to the forum and thanks for posting this.

This looks like intended behavior: If we set

config[‘num_gpus’] = 2

then a single trial has available 2 GPUs - means no other trial has enough hardware ressources to run.

Now, setting

config["num_gpus"]=1

means you give each trial only a single GPU that is shared between the two learner workers. Now there are enough ressources to run two parallel trials.

Hi @Lars_Simon_Zehnder ,

Ah! Many thanks for the clarification - i understand better now the use of this config element.

BTW - i’m really appreciating the configurability of RLlib - very powerful being able to specify the runtime, environment, algorithm (general and specific), and model - all through configuration files!

Really nice design.

Cheers!

-Felgryn

1 Like