Unable to start Model Training with Ray Train on SLURM Cluster

I’d like to set up a Slurm cluster with a master + n worker nodes, then run a HPO framework which runs the training (done with Ray Train) in a spawned process. Each spawn process is on a separate node. When I have 2 interactive SLURM nodes I am able to run my code with the following workflow:

On the Master Node:

  • Start the ray server with option --head
  • Start the script as master

On the Worker node:

  • Join the existing ray cluster
  • start the script as a worker.

How would I represent this in the sbatch script? This is what I have so far as a min example.

#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=20

source ~/ray_env/bin/activate

# ray setup from NERSC repository
echo "Ray setup"
export redis_password

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") # Getting the node names

ip=$(srun --nodes=1 --ntasks=1 -w "$node_1" hostname --ip-address) # making redis-address

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$ip" == *" "* ]]; then
  IFS=' ' read -ra ADDR <<< "$ip"
  if [[ ${#ADDR[0]} -gt 16 ]]; then
  echo "IPV6 address detected. We split the IPV4 address as $ip"

export ip_head
echo "IP Head: $ip_head"

echo "STARTING HEAD at $node_1"
srun --nodes=1 --ntasks=1 -w "$node_1" \
  ray start --head --node-ip-address="$ip" --port=$port --redis-password="$redis_password" --block & ; python3 hp_cluster.py --run_id RayCluster
sleep 30

worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
  echo "STARTING WORKER $i at $node_i"
  srun --nodes=1 --ntasks=1 -w "$node_i" ray start --address "$ip_head" --redis-password="$redis_password" --block & ; python hp_cluster.py --run_id RayCluster --worker
  sleep 5

echo "Setup complete"

However it doesn’t work. The code works since I can do it manually with interactive terminals.
Any suggestions on when and how to run my program after starting the Ray cluster would really help.


Since this is a single Ray cluster, maybe you can just submit all the jobs to the head node?

For more details on SLURM, see Deploying on Slurm — Ray 2.8.1

I’m currently trying that. The HPO framework also has a Master-Worker architecture so I started the Ray cluster, then submitted the calls to start master and worker as separate processes with subprocess.Popen().
Would I face any clashes with this approach?

Thank you :slight_smile: