Configuring config for parallel GPU tune runs on a single server

Felgryn · October 5, 2023, 8:02pm

This topic is Medium to Low, it contributes to some difficulty - but i can work around it.

Hi Folks,

First post to forum and relatively new to Rllib.

I am trying to tune a simple vanilla PPO with the following tuner specs:
tuner = tune.Tuner(
“PPO”,
param_space=config.to_dict(),
tune_config=tune.TuneConfig(num_samples=8, max_concurrent_trials=4),
run_config=air.RunConfig(stop=stop),
)

The config is default for algorithm, PPO, model, and iteration stop parameters. Only changes to config are those below related to resources.

I have a system with 2 GPUs and 48 CPU cores.

I have been working with all sorts of combinations of num_learner_workers, num_gpus_per_learner_worker, num_gpus, and have zeroed in on a behavior i can’t figure out:

when i set the configuration
config[‘num_gpus’] = 2
config[‘num_learner_workers’] = 2
config[‘num_gpus_per_learner_worker’] = 1

Tune proceeds in a serial fashion:

Current time: 2023-10-05 19:48:15 (running for 00:00:10.55)
Using FIFO scheduling algorithm.
Logical resource usage: 3.0/48 CPUs, 2.0/2 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/PPO
Number of trials: 4/8 (3 PENDING, 1 RUNNING)
±-------------------------------±---------±---------------------+
| Trial name | status | loc |
|--------------------------------±---------±---------------------|
| PPO_SimpleCorridor_19707_00000 | RUNNING | 192.168.229.86:36108 |
| PPO_SimpleCorridor_19707_00001 | PENDING | |
| PPO_SimpleCorridor_19707_00002 | PENDING | |
| PPO_SimpleCorridor_19707_00003 | PENDING | |
±-------------------------------±---------±---------------------+
(this repeats with only one trial running at a time)

But when i set the gpus to 1 (even though i physically have two):

config[‘num_gpus’] = 1
config[‘num_learner_workers’] = 2
config[‘num_gpus_per_learner_worker’] = 1

tune finds the other GPU and starts running the trials in parallel:

== Status ==
Current time: 2023-10-05 19:53:09 (running for 00:00:09.86)
Using FIFO scheduling algorithm.
Logical resource usage: 3.0/48 CPUs, 1.0/2 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/PPO
Number of trials: 4/8 (3 PENDING, 1 RUNNING)
±-------------------------------±---------±---------------------+
| Trial name | status | loc |
|--------------------------------±---------±---------------------|
| PPO_SimpleCorridor_c93ec_00000 | RUNNING | 192.168.229.86:27964 |
| PPO_SimpleCorridor_c93ec_00001 | PENDING | |
| PPO_SimpleCorridor_c93ec_00002 | PENDING | |
| PPO_SimpleCorridor_c93ec_00003 | PENDING | |
±-------------------------------±---------±---------------------+

== Status ==
Current time: 2023-10-05 19:53:19 (running for 00:00:20.19)
Using FIFO scheduling algorithm.
Logical resource usage: 6.0/48 CPUs, 2.0/2 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /root/ray_results/PPO
Number of trials: 4/8 (2 PENDING, 2 RUNNING)
±-------------------------------±---------±---------------------+
| Trial name | status | loc |
|--------------------------------±---------±---------------------|
| PPO_SimpleCorridor_c93ec_00000 | RUNNING | 192.168.229.86:27964 |
| PPO_SimpleCorridor_c93ec_00001 | RUNNING | 192.168.229.86:29636 |
| PPO_SimpleCorridor_c93ec_00002 | PENDING | |
| PPO_SimpleCorridor_c93ec_00003 | PENDING | |
±-------------------------------±---------±---------------------+
(the trials process running two at a time)

I’m super stumped. I’m on ray version 2.4 as i have an older server and OS configuration wherein the drivers are holding me back to this version of ray. Not sure if this is something fixed in later versions.

Although this latter configuration is working - it is not intuitive and i imagine i’m not using the configuration parameters correctly.

Any help most welcome.

Cheers!

-felgryn

Lars_Simon_Zehnder · October 9, 2023, 5:08pm

@Felgryn welcome to the forum and thanks for posting this.

This looks like intended behavior: If we set

config[‘num_gpus’] = 2

then a single trial has available 2 GPUs - means no other trial has enough hardware ressources to run.

Now, setting

config["num_gpus"]=1

means you give each trial only a single GPU that is shared between the two learner workers. Now there are enough ressources to run two parallel trials.

Felgryn · October 9, 2023, 5:24pm

Hi @Lars_Simon_Zehnder ,

Ah! Many thanks for the clarification - i understand better now the use of this config element.

BTW - i’m really appreciating the configurability of RLlib - very powerful being able to specify the runtime, environment, algorithm (general and specific), and model - all through configuration files!

Really nice design.

Cheers!

-Felgryn

Topic		Replies	Views
The tune.Tuner.fit is not using GPU with 'num_gpu=1' setting Configure Algorithm, Training, Evaluation, Scaling	1	266	January 22, 2024
tune.Tuner trials not using specified resources with rllib Ray Tune	7	263	March 14, 2025
Does ChatGPT suggests correct config for 1 gpu and 72 cpus? RLlib	1	44	November 18, 2024
GPU accelarate that can not be used with ray and tune in training PPO RLlib	3	849	December 23, 2023
How to set #of cpu and gpu per trial? RLlib	1	984	November 6, 2021

Configuring config for parallel GPU tune runs on a single server

Related topics