Total Workers == (Number of GPUS) - 1?

Gregory · February 8, 2023, 1:35am

I have a system with 4 GPUs which can each hold a single (1) copy of the model being trained. When scaling up my RLlib experiment to use all 4 GPUs, I am only able to use 3 workers but all 4 GPUs are being utilized (noted by the GPU ram filling up). Requesting 4 workers results in ‘pending’ and the experiment does not start due to a lack of resources.

Does the RLlib trainer account for a worker, and even though I’m setting workers to equal 3, in reality there are 4 running? I’ll gladly follow-up with specifics if not.

sven1977 · February 9, 2023, 10:51am

Hey @Gregory , great question!

For “normal” algorithms:

Rollout Workers => only used for environment sampling (and policy inference to compute actions): 1 CPU per worker
Set the number of workers via config.rollouts(num_rollout_workers=n).
GPUs:
Only used on the central learner side, NOT for environment sampling or policy inference. Set this via config.resources(num_gpus=n).

Only our “DDPPO” algorithm can actually utilize GPUs on the Rollout Workers as it learns decentralized. For this algorithm, you should set config.resources(num_gpus_per_worker=1) to make sure each rollout worker can utilize one of your GPUs.

The total number of CPUs being used is always (make sure, your Ray cluster has that many CPUs available):
(num_rollout_workers * num_cpus_per_worker) + num_cpus_for_local_worker

To get a very accurate idea of which resources will be requested, you can also call your algorithm’s class’ default_resource_request method and pass in your config, like so:

from ray.rllib.algorithms.ppo import PPO, PPOConfig

config = PPOConfig().resources(num_gpus=2, num_gpus_per_worker=1, num_cpus_per_worker=2, num_cpus_for_local_worker=3)

resources_needed = PPO.default_resource_request(config)
print(resources_needed)

This will give you the individual “bundles” requested by Ray Tune for a single trial or by RLlib (in case you run an RLlib Algorithm directly, without Tune).
Note here that the first bundle is for the local worker, all the following bundles are for the individual remote rollout workers.

<PlacementGroupFactory (_bound=<BoundArguments (bundles=[
    {'CPU': 3.0, 'GPU': 2.0},  # local worker
    {'CPU': 2.0, 'GPU': 1.0},  # remote worker 1
    {'CPU': 2.0, 'GPU': 1.0},  # remote worker 2 (PPO by default has 2 rollout workers)
], strategy='PACK')>, head_bundle_is_empty=False)>

Topic		Replies	Views
How many workers? Best way to determine number of workers? RLlib	3	1730	January 3, 2023
Training and inference ONLY using GPUs and no CPUs RLlib	7	1800	April 12, 2021
Reserve workers on GPU node for trainer workers only RLlib	7	1100	June 3, 2022
Different hardware usage of rollout-workers during sampling on cluster Configure Algorithm, Training, Evaluation, Scaling	1	438	March 6, 2023
Run DD-PPO in multiple GPUs RLlib	2	359	September 30, 2021

Total Workers == (Number of GPUS) - 1?

Related topics