Low: It annoys or frustrates me for a moment.
I am using 2 nodes and 2 folders. I want to create 2 processes with Slurm, one to go into each folder. So something like:
srun [run python script using one node] &
srun [run same python script using other node] &
So far so good.
Each process I want to further parallelize using Ray. So I want to create two Ray head nodes that do not communicate with each other, since the folders are independent from each other.
How can I do this?
Presumably something like:
srun [run a python script that starts a Ray head node with no worker nodes since just one node] &
srun [run the same python script that creates a Ray head node that does not talk with the Ray head node from first line, since two folders are independent] &
Ray nodes are usually meant to be deployed one per VM/physical machine, which is why we strongly recommend against running multiple head nodes (or even worker nodes) on the same machine. For your use case, it should be okay to run both Python scripts on the same Ray head node, since Ray will internally run the Python scripts in different Ray jobs, parallelize them with separate processes, etc.
I’m a bit confused now. If I have two nodes (each with 20 CPUs), would that be one head node (with 20 CPUs) and one worker node (with 20 CPUs), or would that just be one head node with 40 CPUs (that somehow connects the two nodes I have)?
A Ray node must fit entirely in one machine, so the recommendation would be the former (start a head node one node and a worker node on the other). Then if you want to make sure that a certain task or group of tasks is only scheduled to one node, you could use something like the NodeAffinitySchedulingStrategy.
The other way to do it is to start separate Ray “clusters” by launching one head node per physical node. But this is a bit more complicated than what you need because then you’ll need to manage multiple “clusters”.