@arturn thanks! I was just typing in what I found, see below.
===============================
I did a bit of digging, and it seems like there are 2 solutions:
- using the placement strategy (based on this):
config = {
"env": "CartPole-v0",
"num_workers": 5,
"num_gpus": 0.1,
"framework": "torch",
"placement_strategy": "SPREAD",
}
Which results in:
== Status ==
Current time: 2022-05-03 14:35:46 (running for 00:09:35.54)
Memory usage on this node: 27.6/376.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 60.0/164 CPUs, 0.9999999999999999/1 GPUs, 0.0/997.27 GiB heap, 0.0/431.39 GiB objects
Result logdir: /users/vakker/ray_results/PPO
Number of trials: 10/10 (10 RUNNING)
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
| Trial name | status | loc | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
|-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------|
| PPO_CartPole-v0_98d5c_00000 | RUNNING | 10.12.4.2:205799 | 6 | 94.9154 | 24000 | 154.21 | 200 | 18 | 154.21 |
| PPO_CartPole-v0_98d5c_00001 | RUNNING | 10.12.4.2:206484 | 6 | 96.0914 | 24000 | 143.6 | 200 | 11 | 143.6 |
| PPO_CartPole-v0_98d5c_00002 | RUNNING | 10.12.4.2:206661 | 6 | 92.1264 | 24000 | 154.57 | 200 | 14 | 154.57 |
| PPO_CartPole-v0_98d5c_00003 | RUNNING | 10.12.4.2:206946 | 6 | 91.6675 | 24000 | 156.25 | 200 | 14 | 156.25 |
| PPO_CartPole-v0_98d5c_00004 | RUNNING | 10.12.4.2:207214 | 6 | 93.2915 | 24000 | 156.15 | 200 | 11 | 156.15 |
| PPO_CartPole-v0_98d5c_00005 | RUNNING | 10.12.4.2:207494 | 6 | 91.9662 | 24000 | 150.02 | 200 | 9 | 150.02 |
| PPO_CartPole-v0_98d5c_00006 | RUNNING | 10.12.4.2:207790 | 6 | 92.9504 | 24000 | 148.21 | 200 | 16 | 148.21 |
| PPO_CartPole-v0_98d5c_00007 | RUNNING | 10.12.4.2:208130 | 6 | 93.599 | 24000 | 152.99 | 200 | 20 | 152.99 |
| PPO_CartPole-v0_98d5c_00008 | RUNNING | 10.12.4.2:208478 | 6 | 93.0761 | 24000 | 154.65 | 200 | 15 | 154.65 |
| PPO_CartPole-v0_98d5c_00009 | RUNNING | 10.12.4.2:208818 | 4 | 57.9805 | 16000 | 97.44 | 200 | 9 | 97.44 |
+-----------------------------+----------+------------------+--------+------------------+-------+----------+----------------------+----------------------+--------------------+
And only ray::PPOTrainer
processes on the GPU node. However, this doesn’t provide a hard constraint (if I understand if correctly), so if a new rollout worker can be placed either on a CPU node or a GPU then it might be placed on the GPU node.
I would assume STRICT_SPREAD
would get around this. However, with STRICT_SPREAD
I would need 1+5 nodes for the trainer and the 5 rollout workers and even though there are more than enough CPUs, nothing gets scheduled:
== Status ==
Current time: 2022-05-03 15:59:38 (running for 00:01:10.58)
Memory usage on this node: 22.1/376.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/164 CPUs, 0/1 GPUs, 0.0/994.07 GiB heap, 0.0/430.02 GiB objects
Result logdir: /users/vakker/ray_results/PPO
Number of trials: 10/10 (10 PENDING)
+-----------------------------+----------+-------+
| Trial name | status | loc |
|-----------------------------+----------+-------|
| PPO_CartPole-v0_7d032_00000 | PENDING | |
| PPO_CartPole-v0_7d032_00001 | PENDING | |
| PPO_CartPole-v0_7d032_00002 | PENDING | |
| PPO_CartPole-v0_7d032_00003 | PENDING | |
| PPO_CartPole-v0_7d032_00004 | PENDING | |
| PPO_CartPole-v0_7d032_00005 | PENDING | |
| PPO_CartPole-v0_7d032_00006 | PENDING | |
| PPO_CartPole-v0_7d032_00007 | PENDING | |
| PPO_CartPole-v0_7d032_00008 | PENDING | |
| PPO_CartPole-v0_7d032_00009 | PENDING | |
+-----------------------------+----------+-------+
- using custom resources (based on this):
config = {
"env": "CartPole-v0",
"num_workers": 5,
"num_gpus": 0.1,
"framework": "torch",
"custom_resources_per_worker": {"NO-GPU": 0.00001},
}
And launch the CPU-only workers with
ray start <...> --resources='{"NO-GPU": 1}'
However, this gives the following error:
(PPOTrainer pid=220542, ip=10.10.4.2) 2022-05-03 15:00:20,276 ERROR worker.py:452 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPOTrainer.__init__(
) (pid=220542, ip=10.12.4.2, repr=PPOTrainer)
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 1074, in _init
(PPOTrainer pid=220542, ip=10.10.4.2) raise NotImplementedError
(PPOTrainer pid=220542, ip=10.10.4.2) NotImplementedError
(PPOTrainer pid=220542, ip=10.10.4.2)
(PPOTrainer pid=220542, ip=10.10.4.2) During handling of the above exception, another exception occurred: (PPOTrainer pid=220542, ip=10.10.4.2)
(PPOTrainer pid=220542, ip=10.10.4.2) ray::PPOTrainer.__init__() (pid=220542, ip=10.12.4.2, repr=PPOTrainer)
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 870, in __init__
(PPOTrainer pid=220542, ip=10.10.4.2) super().__init__(
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/tune/trainable.py", line 156, in __init__
(PPOTrainer pid=220542, ip=10.10.4.2) self.setup(copy.deepcopy(self.config))
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 950, in setup (PPOTrainer pid=220542, ip=10.10.4.2) self.workers = WorkerSet(
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 127, in __init__
(PPOTrainer pid=220542, ip=10.10.4.2) self.add_workers(num_workers)
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 240, in add_workers
(PPOTrainer pid=220542, ip=10.10.4.2) [
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 241, in <listcomp>
(PPOTrainer pid=220542, ip=10.10.4.2) self._make_worker(
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 629, in _make_worker
(PPOTrainer pid=220542, ip=10.10.4.2) worker = cls(
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/actor.py", line 522, in remote
(PPOTrainer pid=220542, ip=10.10.4.2) return self._remote(args=args, kwargs=kwargs, **self._default_options)
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/actor.py", line 839, in _remote
(PPOTrainer pid=220542, ip=10.10.4.2) placement_group = configure_placement_group_based_on_context(
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/util/placement_group.py", line 417, in configure_placement_group_based_on_context
(PPOTrainer pid=220542, ip=10.10.4.2) _validate_resource_shape(
(PPOTrainer pid=220542, ip=10.10.4.2) File "/usr/local/lib/python3.9/site-packages/ray/util/placement_group.py", line 332, in _validate_resource_shape
(PPOTrainer pid=220542, ip=10.10.4.2) raise ValueError(
(PPOTrainer pid=220542, ip=10.10.4.2) ValueError: Cannot schedule RolloutWorker with the placement group because the resource request {'NO-GPU': 1e-05, 'CPU': 1, 'GPU': 0} cannot fit into any bundles for the pla
cement group, [{'CPU': 1.0, 'GPU': 0.1}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}, {'CPU': 1.0}].
Even though it does recognise the dummy resource in the status:
== Status ==
Current time: 2022-05-03 15:00:20 (running for 00:00:19.21)
Memory usage on this node: 22.1/376.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/164 CPUs, 0/1 GPUs, 0.0/994.26 GiB heap, 0.0/430.1 GiB objects (0.0/2.0 NO-GPU)
Result logdir: /users/vakker/ray_results/PPO
Number of trials: 10/10 (10 ERROR)
+-----------------------------+----------+-------+
| Trial name | status | loc |
|-----------------------------+----------+-------|
| PPO_CartPole-v0_5306e_00000 | ERROR | |
| PPO_CartPole-v0_5306e_00001 | ERROR | |
| PPO_CartPole-v0_5306e_00002 | ERROR | |
| PPO_CartPole-v0_5306e_00003 | ERROR | |
| PPO_CartPole-v0_5306e_00004 | ERROR | |
| PPO_CartPole-v0_5306e_00005 | ERROR | |
| PPO_CartPole-v0_5306e_00006 | ERROR | |
| PPO_CartPole-v0_5306e_00007 | ERROR | |
| PPO_CartPole-v0_5306e_00008 | ERROR | |
| PPO_CartPole-v0_5306e_00009 | ERROR | |
+-----------------------------+----------+-------+
Is this a bug or am I supposed to change something else to make this work?
This seems to be coming from the fact that it’s not yet implemented for Tune properly, see this. Is that trivial to add? E.g. something like:
[
{
# RolloutWorkers.
"CPU": cf["num_cpus_per_worker"],
"GPU": cf["num_gpus_per_worker"],
**cf["custom_resources_per_worker"],
}
for _ in range(cf["num_workers"])
]
Thanks!