Run PPO on multiple nodes

Theophile_Gervet · August 28, 2022, 6:59am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hey! I’m trying to train PPO on a computationally expensive environment that needs to run on GPU.
I’m running a Ray cluster across multiple (e.g., 8) nodes with 8 GPUs each using Slurm.

How can I get PPO to use the resources available efficiently?

On a single node of 8 GPUs it’s straightforward to split resources across the driver and workers with:

num_workers (e.g., 35)
num_gpus (e.g., 1)
num_gpus_per_worker (e.g., 0.2)

But when trying to scale this to 8 nodes of 8 GPUs with the following:

num_workers = 8 * 35 = 280
num_gpus = 8 * 1 = 8
num_gpus_per_worker = 0.2

the GPUs are not used by workers anymore.

Can we run PPO on multiple nodes? What is the right way to set this up? Is DDPPO the only option or can vanilla PPO work across multiple nodes?

arturn · September 4, 2022, 3:24pm

Hi @Theophile_Gervet ,

Yes, PPO has no notion of nodes. It is scheduled by Tune to a set of resources that it will fill with actors using up said resources. How do you create your cluster though? Because if you connector your nodes to your cluster head, you will have to make the resources available. Have you looked at the example that deals with this?
Sorry for the trouble and the late answer! I hope this fixes your issue.

Topic		Replies	Views
Train PPO on Cluster with multiple nodes RLlib	2	38	January 24, 2025
Run DD-PPO in multiple GPUs RLlib	2	365	September 30, 2021
Different hardware usage of rollout-workers during sampling on cluster Configure Algorithm, Training, Evaluation, Scaling	1	441	March 6, 2023
Multiple GPU head node on GCP Ray Clusters	3	568	April 25, 2022
Does rllib support multi-gpu plus multi-cpu training? Configure Algorithm, Training, Evaluation, Scaling	2	666	March 29, 2024

Run PPO on multiple nodes

Related topics