Reserve workers on GPU node for trainer workers only

vakker00 · April 29, 2022, 1:54pm

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.

I’m running RLlib experiments on a Slurm cluster, let’s say 1 CPU only node and 1 GPU node. It’s all good, but I get rollout workers assigned to the GPU node. The problem with this is that the trainer workers require some amount of CPUs, so Ray cannot create more or them, even though there’s enough GPU memory.
E.g. the status looks like:

== Status ==                                                                                                                                                                                                       
Current time: 2022-04-29 14:04:36 (running for 00:12:48.75)                                                                                                                                                        
Memory usage on this node: 18.1/376.4 GiB                                                                
Using FIFO scheduling algorithm.                    
Resources requested: 22.0/308 CPUs, 0.2/1 GPUs, 0.0/1760.68 GiB heap, 0.0/758.57 GiB objects             
Result logdir: /exps/exp-08/PPO.2022-04-29.13:51:47                                                                                  
Number of trials: 40/40 (36 PENDING, 2 RUNNING, 2 TERMINATED)

Note, I use 0.1 GPUs per experiment.

So, how can I improve this? Maybe with custom resources and start the GPU worker with that assigned?

arturn · April 29, 2022, 2:43pm

The following applies to most algorithms:
The driver process, which is used to run the learner thread (which computes the gradients) only gets one CPU core. If you don’t have any rollout workers, this CPU core will also be used for sampling.

Can you post a config?
What resources does one of your rollout workers need?
What resources to you assign to a learner thread?
How many of these experiments do you want to run in parallel?

vakker00 · May 3, 2022, 9:31am

Thanks for the reply. I think the situation is a bit different in this case.

E.g. consider the following test script:

import argparse

import ray
from ray import tune

parser = argparse.ArgumentParser()

parser.add_argument("--node-ip", type=str, default="127.0.0.1")
parser.add_argument("--head-ip", type=str)
parser.add_argument("--num-cpus", type=str)


if __name__ == "__main__":
    args = parser.parse_args()

    ray.init(
        address=args.head_ip,
        _node_ip_address=args.node_ip,
    )

    config = {
        "env": "CartPole-v0",
        "num_workers": 5,
        "num_gpus": 0.1,
        "framework": "torch",
    }

    stop = {"training_iteration": 20}

    results = tune.run("PPO", config=config, stop=stop, num_samples=10, verbose=3)
    ray.shutdown()

With the following cluster:

1x GPU node: 1 GPU, 20 CPUs
3x CPU nodes: 0 GPU, 48 CPUs each

Tune shows the following report:

== Status ==                             
Current time: 2022-05-03 10:23:30 (running for 00:01:24.91)
Memory usage on this node: 16.8/376.4 GiB                                                                
Using FIFO scheduling algorithm.             
Resources requested: 24.0/164 CPUs, 0.4/1 GPUs, 0.0/997.03 GiB heap, 0.0/431.29 GiB objects
Result logdir: /users/vakker/ray_results/PPO                                                                                                                                                                      
Number of trials: 10/10 (6 PENDING, 4 RUNNING)                                                                                                                                                                     
+-----------------------------+----------+-----------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+                                     
| Trial name                  | status   | loc             |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |                                     
|-----------------------------+----------+-----------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|                                     
| PPO_CartPole-v0_7fd30_00000 | RUNNING  | 10.12.4.2:28666 |      3 |          27.2387 | 12000 |    62.24 |                  200 |                    9 |              62.24 |                                     
| PPO_CartPole-v0_7fd30_00001 | RUNNING  | 10.12.4.2:28872 |      3 |          26.6867 | 12000 |    62.54 |                  200 |                   10 |              62.54 |                                     
| PPO_CartPole-v0_7fd30_00002 | RUNNING  | 10.12.4.2:28984 |      3 |          26.7506 | 12000 |    66.94 |                  200 |                   12 |              66.94 |                                     
| PPO_CartPole-v0_7fd30_00003 | RUNNING  | 10.12.4.2:29205 |      2 |          19.1286 |  8000 |    40.96 |                  123 |                    9 |              40.96 |                                     
| PPO_CartPole-v0_7fd30_00004 | PENDING  |                 |        |                  |       |          |                      |                      |                    |                                     
| PPO_CartPole-v0_7fd30_00005 | PENDING  |                 |        |                  |       |          |                      |                      |                    |                                     
| PPO_CartPole-v0_7fd30_00006 | PENDING  |                 |        |                  |       |          |                      |                      |                    |                                     
| PPO_CartPole-v0_7fd30_00007 | PENDING  |                 |        |                  |       |          |                      |                      |                    |                                     
| PPO_CartPole-v0_7fd30_00008 | PENDING  |                 |        |                  |       |          |                      |                      |                    |                                     
| PPO_CartPole-v0_7fd30_00009 | PENDING  |                 |        |                  |       |          |                      |                      |                    |
+-----------------------------+----------+-----------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+

The reported resources are correct. If I check the GPU node, then all the rollout workers are running there, so they take up all the available CPUs on that node. So, there’s only 4 parallel trials and no other trainers scheduled, even though there would be enough GPU memory to handle that:

vakker   30923 86.6  1.1 124837412 4442412 ?   Rl   10:26   1:19 ray::PPOTrainer.train()    
vakker   30957 86.9  1.1 124835428 4441068 ?   Rl   10:26   1:18 ray::PPOTrainer.train()
vakker   30961 86.7  1.1 124763444 4440652 ?   Dl   10:26   1:17 ray::PPOTrainer.train()
vakker   30965 13.3  0.1 114018324 410300 ?    Sl   10:26   0:11 ray::RolloutWorker
vakker   30966 13.3  0.1 114018360 406728 ?    Sl   10:26   0:11 ray::RolloutWorker
vakker   30967 13.1  0.1 114018080 410424 ?    Sl   10:26   0:11 ray::RolloutWorker
vakker   30968 13.0  0.1 114018104 408372 ?    Sl   10:26   0:11 ray::RolloutWorker   
vakker   30969 13.3  0.1 114018260 406696 ?    Sl   10:26   0:11 ray::RolloutWorker
vakker   30996 85.3  1.1 124761628 4436840 ?   Rl   10:26   1:15 ray::PPOTrainer.train()
vakker   31191 12.8  0.1 114018072 407240 ?    Sl   10:26   0:11 ray::RolloutWorker
vakker   31192 12.6  0.1 114018180 412952 ?    Sl   10:26   0:10 ray::RolloutWorker
vakker   31193 12.6  0.1 114018328 407008 ?    Sl   10:26   0:10 ray::RolloutWorker
vakker   31194 12.7  0.1 114018032 408576 ?    Sl   10:26   0:10 ray::RolloutWorker
vakker   31195 12.6  0.1 114018004 407324 ?    Sl   10:26   0:10 ray::RolloutWorker
vakker   31203 12.9  0.1 114018252 410816 ?    Sl   10:26   0:11 ray::RolloutWorker
vakker   31204 12.5  0.1 114018216 408820 ?    Sl   10:26   0:10 ray::RolloutWorker
vakker   31205 12.9  0.1 114018068 408968 ?    Sl   10:26   0:11 ray::RolloutWorker
vakker   31206 12.6  0.1 114018216 408876 ?    Sl   10:26   0:10 ray::RolloutWorker
vakker   31207 12.8  0.1 114017920 413076 ?    Sl   10:26   0:10 ray::RolloutWorker
vakker   31349 13.0  0.1 114018328 408244 ?    Sl   10:26   0:11 ray::RolloutWorker

So, the question is how to make sure that there are only ray::PPOTrainer processes scheduled on the GPU node and no ray::RolloutWorker processes.

arturn · May 3, 2022, 1:59pm

What happens if you choose
{… “placement_strategy”: “STRICT_SPREAD” … }?

vakker00 · May 3, 2022, 3:22pm

@arturn thanks! I was just typing in what I found, see below.

===============================

I did a bit of digging, and it seems like there are 2 solutions:

using the placement strategy (based on this):

    config = {
        "env": "CartPole-v0",
        "num_workers": 5,
        "num_gpus": 0.1,
        "framework": "torch",
        "placement_strategy": "SPREAD",
    }

Which results in:

== Status ==                                
Current time: 2022-05-03 14:35:46 (running for 00:09:35.54)
Memory usage on this node: 27.6/376.4 GiB       
Using FIFO scheduling algorithm.    
Resources requested: 60.0/164 CPUs, 0.9999999999999999/1 GPUs, 0.0/997.27 GiB heap, 0.0/431.39 GiB objects
Result logdir: /users/vakker/ray_results/PPO
Number of trials: 10/10 (10 RUNNING)       
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name                  | status   | loc              |   iter |   total time (s) |    ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v0_98d5c_00000 | RUNNING  | 10.12.4.2:205799 |      6 |          94.9154 | 24000 |   154.21 |                  200 |                   18 |             154.21 |
| PPO_CartPole-v0_98d5c_00001 | RUNNING  | 10.12.4.2:206484 |      6 |          96.0914 | 24000 |   143.6  |                  200 |                   11 |             143.6  |
| PPO_CartPole-v0_98d5c_00002 | RUNNING  | 10.12.4.2:206661 |      6 |          92.1264 | 24000 |   154.57 |                  200 |                   14 |             154.57 |
| PPO_CartPole-v0_98d5c_00003 | RUNNING  | 10.12.4.2:206946 |      6 |          91.6675 | 24000 |   156.25 |                  200 |                   14 |             156.25 |
| PPO_CartPole-v0_98d5c_00004 | RUNNING  | 10.12.4.2:207214 |      6 |          93.2915 | 24000 |   156.15 |                  200 |                   11 |             156.15 |
| PPO_CartPole-v0_98d5c_00005 | RUNNING  | 10.12.4.2:207494 |      6 |          91.9662 | 24000 |   150.02 |                  200 |                    9 |             150.02 |
| PPO_CartPole-v0_98d5c_00006 | RUNNING  | 10.12.4.2:207790 |      6 |          92.9504 | 24000 |   148.21 |                  200 |                   16 |             148.21 |
| PPO_CartPole-v0_98d5c_00007 | RUNNING  | 10.12.4.2:208130 |      6 |          93.599  | 24000 |   152.99 |                  200 |                   20 |             152.99 |
| PPO_CartPole-v0_98d5c_00008 | RUNNING  | 10.12.4.2:208478 |      6 |          93.0761 | 24000 |   154.65 |                  200 |                   15 |             154.65 |
| PPO_CartPole-v0_98d5c_00009 | RUNNING  | 10.12.4.2:208818 |      4 |          57.9805 | 16000 |    97.44 |                  200 |                    9 |              97.44 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+

And only ray::PPOTrainer processes on the GPU node. However, this doesn’t provide a hard constraint (if I understand if correctly), so if a new rollout worker can be placed either on a CPU node or a GPU then it might be placed on the GPU node.

I would assume STRICT_SPREAD would get around this. However, with STRICT_SPREAD I would need 1+5 nodes for the trainer and the 5 rollout workers and even though there are more than enough CPUs, nothing gets scheduled:

== Status ==
Current time: 2022-05-03 15:59:38 (running for 00:01:10.58)
Memory usage on this node: 22.1/376.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/164 CPUs, 0/1 GPUs, 0.0/994.07 GiB heap, 0.0/430.02 GiB objects
Result logdir: /users/vakker/ray_results/PPO
Number of trials: 10/10 (10 PENDING)
+-----------------------------+----------+-------+
| Trial name                  | status   | loc   |
|-----------------------------+----------+-------|
| PPO_CartPole-v0_7d032_00000 | PENDING  |       |
| PPO_CartPole-v0_7d032_00001 | PENDING  |       |
| PPO_CartPole-v0_7d032_00002 | PENDING  |       |
| PPO_CartPole-v0_7d032_00003 | PENDING  |       |
| PPO_CartPole-v0_7d032_00004 | PENDING  |       |
| PPO_CartPole-v0_7d032_00005 | PENDING  |       |
| PPO_CartPole-v0_7d032_00006 | PENDING  |       |
| PPO_CartPole-v0_7d032_00007 | PENDING  |       |
| PPO_CartPole-v0_7d032_00008 | PENDING  |       |
| PPO_CartPole-v0_7d032_00009 | PENDING  |       |
+-----------------------------+----------+-------+

using custom resources (based on this):

    config = {
        "env": "CartPole-v0",
        "num_workers": 5,
        "num_gpus": 0.1,
        "framework": "torch",
        "custom_resources_per_worker": {"NO-GPU": 0.00001},
    }

And launch the CPU-only workers with

ray start <...> --resources='{"NO-GPU": 1}'

However, this gives the following error:

(PPOTrainer pid=220542, ip=10.10.4.2) 2022-05-03 15:00:20,276   ERROR worker.py:452 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPOTrainer.__init__(
) (pid=220542, ip=10.12.4.2, repr=PPOTrainer)                                                            
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 1074, in _init                                                                             
(PPOTrainer pid=220542, ip=10.10.4.2)     raise NotImplementedError                                      
(PPOTrainer pid=220542, ip=10.10.4.2) NotImplementedError                                                                                                                                                          
(PPOTrainer pid=220542, ip=10.10.4.2)                                                                    
(PPOTrainer pid=220542, ip=10.10.4.2) During handling of the above exception, another exception occurred:                                                                                                          (PPOTrainer pid=220542, ip=10.10.4.2)                                                                                                                                                                              
(PPOTrainer pid=220542, ip=10.10.4.2) ray::PPOTrainer.__init__() (pid=220542, ip=10.12.4.2, repr=PPOTrainer)
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 870, in __init__                                                                           
(PPOTrainer pid=220542, ip=10.10.4.2)     super().__init__(                                              
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/tune/trainable.py", line 156, in __init__
(PPOTrainer pid=220542, ip=10.10.4.2)     self.setup(copy.deepcopy(self.config))                                                                                                                                   
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 950, in setup                                                                              (PPOTrainer pid=220542, ip=10.10.4.2)     self.workers = WorkerSet(                                                                                                                                                
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 127, in __init__
(PPOTrainer pid=220542, ip=10.10.4.2)     self.add_workers(num_workers)                                                                                                                                            
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 240, in add_workers
(PPOTrainer pid=220542, ip=10.10.4.2)     [                                                                                                                                                                        
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 241, in <listcomp>
(PPOTrainer pid=220542, ip=10.10.4.2)     self._make_worker(                                                                                                                                                       
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 629, in _make_worker
(PPOTrainer pid=220542, ip=10.10.4.2)     worker = cls(                                                                                                                                                            
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/actor.py", line 522, in remote                    
(PPOTrainer pid=220542, ip=10.10.4.2)     return self._remote(args=args, kwargs=kwargs, **self._default_options)                           
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/actor.py", line 839, in _remote                   
(PPOTrainer pid=220542, ip=10.10.4.2)     placement_group = configure_placement_group_based_on_context(
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/util/placement_group.py", line 417, in configure_placement_group_based_on_context                                         
(PPOTrainer pid=220542, ip=10.10.4.2)     _validate_resource_shape(
(PPOTrainer pid=220542, ip=10.10.4.2)   File "/usr/local/lib/python3.9/site-packages/ray/util/placement_group.py", line 332, in _validate_resource_shape
(PPOTrainer pid=220542, ip=10.10.4.2)     raise ValueError(        
(PPOTrainer pid=220542, ip=10.10.4.2) ValueError: Cannot schedule RolloutWorker with the placement group because the resource request {'NO-GPU': 1e-05, 'CPU': 1, 'GPU': 0} cannot fit into any bundles for the pla
cement group, [{'CPU': 1.0, 'GPU': 0.1}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}].

Even though it does recognise the dummy resource in the status:

== Status ==                                                                                                                                                                                                       
Current time: 2022-05-03 15:00:20 (running for 00:00:19.21)                                                                                                                                                        
Memory usage on this node: 22.1/376.4 GiB                                                                                                                                                                          
Using FIFO scheduling algorithm.                                                                                                                                                                                   
Resources requested: 0/164 CPUs, 0/1 GPUs, 0.0/994.26 GiB heap, 0.0/430.1 GiB objects (0.0/2.0 NO-GPU)                                                                                                             
Result logdir: /users/vakker/ray_results/PPO                                                                                                                                                                      
Number of trials: 10/10 (10 ERROR)                                                                                                                                                                                 
+-----------------------------+----------+-------+                                                                                                                                                                 
| Trial name                  | status   | loc   |                                                                                                                                                                 
|-----------------------------+----------+-------|                                                                                                                                                                 
| PPO_CartPole-v0_5306e_00000 | ERROR    |       |                                                                                                                                                                 
| PPO_CartPole-v0_5306e_00001 | ERROR    |       |                                                       
| PPO_CartPole-v0_5306e_00002 | ERROR    |       |                                                                                                                                                                 
| PPO_CartPole-v0_5306e_00003 | ERROR    |       |                                                       
| PPO_CartPole-v0_5306e_00004 | ERROR    |       |                                                                                                                                                                 
| PPO_CartPole-v0_5306e_00005 | ERROR    |       |                                                       
| PPO_CartPole-v0_5306e_00006 | ERROR    |       |                                                                                                                                                                 
| PPO_CartPole-v0_5306e_00007 | ERROR    |       |                                                                                                                                                                 
| PPO_CartPole-v0_5306e_00008 | ERROR    |       |                                                                                                                                                                 
| PPO_CartPole-v0_5306e_00009 | ERROR    |       |                                                                                                                                                                 
+-----------------------------+----------+-------+

Is this a bug or am I supposed to change something else to make this work?

This seems to be coming from the fact that it’s not yet implemented for Tune properly, see this. Is that trivial to add? E.g. something like:

            [
                {
                    # RolloutWorkers.
                    "CPU": cf["num_cpus_per_worker"],
                    "GPU": cf["num_gpus_per_worker"],
                    **cf["custom_resources_per_worker"],
                }
                for _ in range(cf["num_workers"])
            ]

Thanks!

vakker00 · May 31, 2022, 5:51pm

@arturn do you have any input to this? I started a PR (24463) that tries to resolve the custom resource issue with Tune, but it doesn’t seem to be that simple.

arturn · June 3, 2022, 3:23pm

Hey @vakker00 ,

Thanks for raising this! I’ve taken a look at the PR and will add some commits and ask for review. Keep and eye on it if you want to add things, other than that I’ll do my best to sort this out

arturn · June 3, 2022, 10:14pm

Kai was faster then me, should be merged very soon! Thanks for the PR

Topic		Replies	Views
Cpu allocation confusion	3	1370	March 7, 2023
How to correclty allocate resources with Tune + TorchTrainer on Slurm	2	451	December 20, 2022
All ray resources mapped to only two physical processors Configure Algorithm, Training, Evaluation, Scaling	0	205	December 8, 2023
Tune + Pytorch Lightning on Slurm: How to correctly assign the resources Ray Clusters	1	797	January 12, 2023
Num_gpu, rollout_workers, learner_workers, evaluation_workers purpose + resource allocation Configure Algorithm, Training, Evaluation, Scaling	8	2070	August 24, 2023

Reserve workers on GPU node for trainer workers only

Related topics