Unable to start Model Training with Ray Train on SLURM Cluster

f2010126 · December 11, 2023, 1:23pm

I’d like to set up a Slurm cluster with a master + n worker nodes, then run a HPO framework which runs the training (done with Ray Train) in a spawned process. Each spawn process is on a separate node. When I have 2 interactive SLURM nodes I am able to run my code with the following workflow:

On the Master Node:

Start the ray server with option --head
Start the script as master

On the Worker node:

Join the existing ray cluster
start the script as a worker.

How would I represent this in the sbatch script? This is what I have so far as a min example.

#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=20

source ~/ray_env/bin/activate

# ray setup from NERSC repository
echo "Ray setup"
redis_password=$(uuidgen)
export redis_password

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") # Getting the node names
nodes_array=($nodes)

node_1=${nodes_array[0]}
ip=$(srun --nodes=1 --ntasks=1 -w "$node_1" hostname --ip-address) # making redis-address

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$ip" == *" "* ]]; then
  IFS=' ' read -ra ADDR <<< "$ip"
  if [[ ${#ADDR[0]} -gt 16 ]]; then
    ip=${ADDR[1]}
  else
    ip=${ADDR[0]}
  fi
  echo "IPV6 address detected. We split the IPV4 address as $ip"
fi

port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "STARTING HEAD at $node_1"
srun --nodes=1 --ntasks=1 -w "$node_1" \
  ray start --head --node-ip-address="$ip" --port=$port --redis-password="$redis_password" --block & ; python3 hp_cluster.py --run_id RayCluster
sleep 30

worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --nodes=1 --ntasks=1 -w "$node_i" ray start --address "$ip_head" --redis-password="$redis_password" --block & ; python hp_cluster.py --run_id RayCluster --worker
  sleep 5
done


echo "Setup complete"

However it doesn’t work. The code works since I can do it manually with interactive terminals.
Any suggestions on when and how to run my program after starting the Ray cluster would really help.

Thanks!

matthewdeng · December 13, 2023, 7:38pm

Since this is a single Ray cluster, maybe you can just submit all the jobs to the head node?

For more details on SLURM, see Deploying on Slurm — Ray 2.8.1

f2010126 · December 17, 2023, 4:40am

I’m currently trying that. The HPO framework also has a Master-Worker architecture so I started the Ray cluster, then submitted the calls to start master and worker as separate processes with subprocess.Popen().
Would I face any clashes with this approach?

Thank you

Topic		Replies	Views
Running Ray on Slurm Cluster	11	1218	January 31, 2021
Tune + Pytorch Lightning on Slurm: How to correctly assign the resources Ray Clusters	1	796	January 12, 2023
[Slurm] Proper way to launch the same script on n independent nodes Ray Core	1	391	May 21, 2021
Ray on SLURM/HPC: starting worker nodes simultaneously Ray Clusters	10	2015	June 15, 2022
Ray on slurm - Problems with initialization Ray Clusters	6	3617	December 29, 2022

Unable to start Model Training with Ray Train on SLURM Cluster

Related topics