RLLib not using worker nodes in Ray Cluster

Destin · August 25, 2023, 3:02am

I created a 4 node Ray Cluster by using ray start on each node. For the head node I specified num_cpus=0 and num_gpus=0. For the other worker nodes I set the num_cpus and num_gpus to the physical CPU and GPU cores on those respective systems.

I verified in the logs that the Ray Cluster started correctly.

$ ray status

Node status

Healthy:
1 node_…
1 node_…
1 node_…
1 node_…
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
0.0/84.0 CPU
0B/674.93GiB memory
0B/293.25GiB object_store_memory

As a basic test I created a remote task that simply gets the ip address and returns it. This works exactly as I would expect. The remote actors are randomly sent to different worker nodes and gets their IP address. I see an equal distribution of IP addresses across the worker nodes. The head node is ignored because I specified in Ray Start to not use it.

The problem is that when I start using RLLib it will only schedule tasks to run on the head node and not any of the worker nodes. I tested with running Cartpole with PPO and Distributed PPO. It does not matter whether I change the number of rollout_workers, or envs_per_worker. It appears all CPU utilization is being performed on the head node and the worker nodes are ignored.

From reading the docs the rollout worker is a remote actor and should be distributed across all worker nodes. They should not be sent to the head node because I explicitly told ray start that the num_cpus=0. Can someone explain why standard Ray code works as I expect but RLLib is not working the way the documentation says it should work?

Topic		Replies	Views
Worker nodes not utilized RLlib	1	291	June 10, 2022
Reserve workers on GPU node for trainer workers only RLlib	7	1112	June 3, 2022
Example using RLLIB via KubeRay Kubernetes	1	112	May 30, 2024
IDLE ray worker nodes in map_batches	2	372	May 24, 2023
Ray cluster uses only Head node Ray Clusters	3	446	June 28, 2021

RLLib not using worker nodes in Ray Cluster

Node status

Resources

Related topics