I have a research machine with one GPU and 60 CPUs, and I’m training an RL agent on a computationally-expensive environment using PPO.
Seeing as PPO runs rollouts and then weight updates sequentially rather than in parallel, I’d rather the GPU not spend half the time sitting idle. Accordingly, I’d like to use my GPU for both inference and training. How would I go about designing my config such that this takes place?
To use your single GPU for both inference (rollouts) and training in RLlib PPO, set num_env_runners=0 so that rollouts and training both occur on the main process, which can use the GPU for both. This is recommended when you have only one GPU and want to maximize its utilization, especially for compute-intensive environments. Set num_gpus: 1 in your config to ensure the main process uses the GPU. Do not allocate GPUs to rollout workers (num_gpus_per_env_runner), as this would require multiple GPUs. For efficient use, keep a large number of environments per worker if possible, but with only one GPU, all inference and training will share it sequentially on the main process.
Example config:
from ray.rllib.algorithms.ppo import PPOConfig
config = (
PPOConfig()
.env("YourEnv")
.env_runners(num_env_runners=0) # Use main process for rollouts
.resources(num_gpus=1) # Use GPU for main process
# Optionally, tune other parameters as needed
)
This setup ensures both inference and training use the GPU, avoiding idle time between phases. For more details, see RLlib Training Guide.
Would you like a deeper explanation or more advanced configuration options?
For synchronous algorithms like PPO and A2C, the driver and workers can make use of the same GPU. To do this for an amount of n GPUS:
gpu_count = n
num_gpus = 0.0001 # Driver GPU
num_gpus_per_env_runner = (gpu_count - num_gpus) / num_env_runners
That said, I haven’t been able to get this to work in practice:
# The num_learners setting determines how many remote Learner workers there are in your Algorithm’s LearnerGroup. If you set this to 0, your LearnerGroup only contains a local Learner that runs on the main process (and shares the compute resources with that process, usually 1 CPU). Cannot set both `num_cpus_per_learner` > 1 and `num_gpus_per_learner` > 0!
# Either set `num_cpus_per_learner` > 1 (and `num_gpus_per_learner`=0) OR
# set `num_gpus_per_learner` > 0 (and leave `num_cpus_per_learner` at its default value of 1). This is due to issues with placement group fragmentation.
.resources(num_gpus=args.num_gpus)
.learners(num_gpus_per_learner=split_gpu_trainer)
.env_runners(
num_env_runners=args.num_env_runners,
num_gpus_per_env_runner=num_gpu_per_worker,
num_cpus_per_env_runner=num_cpu_per_worker,
)
Based on my experience, I recommend you to have your own function which calculates the resources based on following factors:
The number of trials you want to run in parallel / concurrent
The number of env runners or workers (old stack) per trial
The number of CPU and GPU available in total
In the example above, the output of this own function are the variables split_gpu_trainer, num_gpu_per_worker and num_cpu_per_worker .
@PhilippWillms Interesting information; it’s always neat to see how the systems side of Ray/RLlib works. I ran some more tests locally, monitoring resource usage, and I think I cracked the issue. My test notes are below (repo for replication if desired):
# My local machine has 16 CPUs and 1 GPU
python run_training.py --env-name SW_1v1_env_singleplayer --env-config '{"speed": 5.0, "ep_length": 4096, "egocentric": true, "size_multiplier": 1.0, "grav_multiplier": 1.0, "target_speed": 1.0, "target_ammo": 0.0, "elliptical": true, "probabilistic_difficulty": false, "inform_critic": false}' --verbose 1 --batch-size 65536 --minibatch-size 8192 --gamma .999 --attn-dim 128 --attn-ff-dim 512 --lr 1e-4 --lambda_ .8 --vf-clip inf --stop-iters=2 --num-env-runners 1 --gpus-per-learner 0.0001 --cpus-per-env-runner 15 --envs-per-env-runner 60 --remote-worker-envs
# Okay, give an env runner 15 CPUs and check usage:
# short burst of 33% GPU usage, 11% CPU usuage, followed by burst of 90% with 10% CPU usage throughout. Memory at 77%
# First epoch done in 78s; Second in 157s
# Let's triple the number of envs per env runner, up to 180. No other changes:
# short burst of 24% GPU usage, 11% CPU usage, followed by burst of 90% GPU usage, still 10% CPU usage throughout.
# First epoch done in 80s, second in 164s
# Okay, let's divide the number of envs per runner by three, down to 20.
# short burst of 40% GPU usage, CPU still 10% throughout.
# First epoch done in 98s, second in 200s
# Try it with 1.0 GPUs per learner, just to see if the GPU helps
# 0% GPU usage, then 90% during training. CPU isn't constant, but goes to 40% during rollouts.
# First epoch done in 89s. Second in 182s. This is the opposite of what I'd expect - denying GPU usage to workers yields (slightly) better results!
Intuitively, I’m not sure that to make of this. Neither CPU nor GPU seem to be the bottleneck during rollouts. My best guess is that I’m dealing with a less-than-perfectly optimized environment, so I’m not able to push my CPUs to 100 percent. I was using a bunch of env runners in most of my earlier tests, so I suspect that lack of parallelism across workers was responsible for some of what I saw when adding GPUs to them, and I’ve now seen firsthand that GPU resources indeed get reallocated properly between learners and env runners when alternating between rollouts and training.
The good news is that weight updates take up about 70% of my time for each step, and I’ve now got a better sense of how to shave off some of the remainder.