I am running Ray 1.12.1 on an HPC cluster running RHEL 8.5. I am attempting to launch jobs using slurm and the following adapted code (the long 60 second sleep is for debugging purposes):
# launch head node, leaving one core unused for the main python script
echo "STARTING HEAD at $head_node"
srun --job-name="ray-head" --unbuffered --nodes=1 --ntasks=1 -w "$head_node" \
conda-run.sh "${HEAD_CMD}" &
# if we are running on more than one node, start worker nodes
if [[ $SLURM_JOB_NUM_NODES != "1" ]]
then
sleep 60 # wait for the head node to fully start before launching worker nodes
worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
echo "STARTING ${worker_num} WORKER NODES"
srun --job-name="ray-workers" --nodes=${worker_num} --ntasks=${worker_num} -w "${worker_nodes}" \
conda-run.sh "${WORKER_CMD}" &
fi
Whenever I run more than two worker nodes I get errors as follows:
STARTING 2 WORKER NODES
[2022-06-06 21:36:17,728 I 244176 244176] global_state_accessor.cc:357: This node has an IP address of 172.21.4.81, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
2022-06-06 21:36:16,209 INFO scripts.py:870 -- Local node IP: 172.21.4.80
2022-06-06 21:36:17,729 SUCC scripts.py:882 -- --------------------
2022-06-06 21:36:17,729 SUCC scripts.py:883 -- Ray runtime started.
2022-06-06 21:36:17,729 SUCC scripts.py:884 -- --------------------
2022-06-06 21:36:17,729 INFO scripts.py:886 -- To terminate the Ray runtime, run
2022-06-06 21:36:17,729 INFO scripts.py:887 -- ray stop
2022-06-06 21:36:17,729 INFO scripts.py:892 -- --block
2022-06-06 21:36:17,729 INFO scripts.py:893 -- This command will now block until terminated by a signal.
2022-06-06 21:36:17,729 INFO scripts.py:896 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-06-06 21:36:16,209 INFO scripts.py:870 -- Local node IP: 172.21.4.81
2022-06-06 21:36:17,730 SUCC scripts.py:882 -- --------------------
2022-06-06 21:36:17,730 SUCC scripts.py:883 -- Ray runtime started.
2022-06-06 21:36:17,730 SUCC scripts.py:884 -- --------------------
2022-06-06 21:36:17,730 INFO scripts.py:886 -- To terminate the Ray runtime, run
2022-06-06 21:36:17,730 INFO scripts.py:887 -- ray stop
2022-06-06 21:36:17,730 INFO scripts.py:892 -- --block
2022-06-06 21:36:17,730 INFO scripts.py:893 -- This command will now block until terminated by a signal.
2022-06-06 21:36:17,730 INFO scripts.py:896 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-06-06 21:36:18,731 ERR scripts.py:907 -- Some Ray subprcesses exited unexpectedly:
2022-06-06 21:36:18,731 ERR scripts.py:911 -- raylet [exit code=-6]
2022-06-06 21:36:18,731 ERR scripts.py:919 -- Remaining processes will be killed.
Exit with error code 1 (suppressed)