I created a 4 node Ray Cluster by using ray start
on each node. For the head node I specified num_cpus=0
and num_gpus=0
. For the other worker nodes I set the num_cpus
and num_gpus
to the physical CPU and GPU cores on those respective systems.
I verified in the logs that the Ray Cluster started correctly.
$ ray status
Node status
Healthy:
1 node_…
1 node_…
1 node_…
1 node_…
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Usage:
0.0/84.0 CPU
0B/674.93GiB memory
0B/293.25GiB object_store_memory
As a basic test I created a remote task that simply gets the ip address and returns it. This works exactly as I would expect. The remote actors are randomly sent to different worker nodes and gets their IP address. I see an equal distribution of IP addresses across the worker nodes. The head node is ignored because I specified in Ray Start to not use it.
The problem is that when I start using RLLib it will only schedule tasks to run on the head node and not any of the worker nodes. I tested with running Cartpole with PPO and Distributed PPO. It does not matter whether I change the number of rollout_workers, or envs_per_worker. It appears all CPU utilization is being performed on the head node and the worker nodes are ignored.
From reading the docs the rollout worker is a remote actor and should be distributed across all worker nodes. They should not be sent to the head node because I explicitly told ray start
that the num_cpus=0
. Can someone explain why standard Ray code works as I expect but RLLib is not working the way the documentation says it should work?