Multiple Ray Head Nodes

xzf0kgb0bqr.cev2RWU · January 18, 2023, 3:56am

Low: It annoys or frustrates me for a moment.

I am using 2 nodes and 2 folders. I want to create 2 processes with Slurm, one to go into each folder. So something like:

srun [run python script using one node] &
srun [run same python script using other node]  &
wait

So far so good.

Question

Each process I want to further parallelize using Ray. So I want to create two Ray head nodes that do not communicate with each other, since the folders are independent from each other.

How can I do this?

Presumably something like:

srun [run a python script that starts a Ray head node with no worker nodes since just one node] &
srun [run the same python script that creates a Ray head node that does not talk with the Ray head node from first line, since two folders are independent] &
wait

Stephanie_Wang · January 19, 2023, 3:45pm

Ray nodes are usually meant to be deployed one per VM/physical machine, which is why we strongly recommend against running multiple head nodes (or even worker nodes) on the same machine. For your use case, it should be okay to run both Python scripts on the same Ray head node, since Ray will internally run the Python scripts in different Ray jobs, parallelize them with separate processes, etc.

xzf0kgb0bqr.cev2RWU · January 20, 2023, 3:42am

I’m a bit confused now. If I have two nodes (each with 20 CPUs), would that be one head node (with 20 CPUs) and one worker node (with 20 CPUs), or would that just be one head node with 40 CPUs (that somehow connects the two nodes I have)?

Thanks!

Stephanie_Wang · January 23, 2023, 5:16pm

A Ray node must fit entirely in one machine, so the recommendation would be the former (start a head node one node and a worker node on the other). Then if you want to make sure that a certain task or group of tasks is only scheduled to one node, you could use something like the NodeAffinitySchedulingStrategy.

The other way to do it is to start separate Ray “clusters” by launching one head node per physical node. But this is a bit more complicated than what you need because then you’ll need to manage multiple “clusters”.

Topic		Replies	Views
Ray on SLURM/HPC: starting worker nodes simultaneously Ray Clusters	10	2005	June 15, 2022
Running Ray on Slurm Cluster	11	1216	January 31, 2021
[Slurm] Proper way to launch the same script on n independent nodes Ray Core	1	391	May 21, 2021
Parallelization of Graph algorithm on Ray Cluster + SLURM Ray Core	7	717	December 7, 2022
Ray actor only uses one core on a cluster managed using SLURM Ray Clusters	1	436	September 16, 2021

Multiple Ray Head Nodes

Related topics