I’d like to set up a Slurm cluster with a master + n worker nodes, then run a HPO framework which runs the training (done with Ray Train) in a spawned process. Each spawn process is on a separate node. When I have 2 interactive SLURM nodes I am able to run my code with the following workflow:
On the Master Node:
- Start the ray server with option
--head
- Start the script as master
On the Worker node:
- Join the existing ray cluster
- start the script as a worker.
How would I represent this in the sbatch script? This is what I have so far as a min example.
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=20
source ~/ray_env/bin/activate
# ray setup from NERSC repository
echo "Ray setup"
redis_password=$(uuidgen)
export redis_password
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") # Getting the node names
nodes_array=($nodes)
node_1=${nodes_array[0]}
ip=$(srun --nodes=1 --ntasks=1 -w "$node_1" hostname --ip-address) # making redis-address
# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<< "$ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
ip=${ADDR[1]}
else
ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $ip"
fi
port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"
echo "STARTING HEAD at $node_1"
srun --nodes=1 --ntasks=1 -w "$node_1" \
ray start --head --node-ip-address="$ip" --port=$port --redis-password="$redis_password" --block & ; python3 hp_cluster.py --run_id RayCluster
sleep 30
worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo "STARTING WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w "$node_i" ray start --address "$ip_head" --redis-password="$redis_password" --block & ; python hp_cluster.py --run_id RayCluster --worker
sleep 5
done
echo "Setup complete"
However it doesn’t work. The code works since I can do it manually with interactive terminals.
Any suggestions on when and how to run my program after starting the Ray cluster would really help.
Thanks!