Ray on SLURM/HPC: starting worker nodes simultaneously

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I am attempting to run Ray on an HPC cluster using SLURM, based on the script and information available here: https://docs.ray.io/en/latest/cluster/slurm.html

One thing that seemed inefficient in the instructions is that worker nodes are launched one by one, with a 5-second sleep after each launch. I tried replacing that by a single srun invocation that launches all worker nodes at the same time, but that invariably leads to one or more of them crashing on startup (with various error messages).

Does anyone know what causes this behaviour? Is the head node not able to safely ā€˜on-boardā€™ multiple worker nodes at the same time? Or do the worker nodes access some non-shareable resource, perhaps?

I donā€™t think there are any known issues here, but do you mind sharing your ray version, error message, and relevant parts of your config?

I am running Ray 1.12.1 on an HPC cluster running RHEL 8.5. I am attempting to launch jobs using slurm and the following adapted code (the long 60 second sleep is for debugging purposes):

# launch head node, leaving one core unused for the main python script
echo "STARTING HEAD at $head_node"
srun --job-name="ray-head" --unbuffered --nodes=1 --ntasks=1 -w "$head_node" \
        conda-run.sh "${HEAD_CMD}" &

# if we are running on more than one node, start worker nodes
if [[ $SLURM_JOB_NUM_NODES != "1" ]]
then
 sleep 60  # wait for the head node to fully start before launching worker nodes
 worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
 echo "STARTING ${worker_num} WORKER NODES"
 srun --job-name="ray-workers" --nodes=${worker_num} --ntasks=${worker_num} -w "${worker_nodes}" \
        conda-run.sh "${WORKER_CMD}" &
fi

Whenever I run more than two worker nodes I get errors as follows:

STARTING 2 WORKER NODES
[2022-06-06 21:36:17,728 I 244176 244176] global_state_accessor.cc:357: This node has an IP address of 172.21.4.81, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
2022-06-06 21:36:16,209	INFO scripts.py:870 -- Local node IP: 172.21.4.80
2022-06-06 21:36:17,729	SUCC scripts.py:882 -- --------------------
2022-06-06 21:36:17,729	SUCC scripts.py:883 -- Ray runtime started.
2022-06-06 21:36:17,729	SUCC scripts.py:884 -- --------------------
2022-06-06 21:36:17,729	INFO scripts.py:886 -- To terminate the Ray runtime, run
2022-06-06 21:36:17,729	INFO scripts.py:887 --   ray stop
2022-06-06 21:36:17,729	INFO scripts.py:892 -- --block
2022-06-06 21:36:17,729	INFO scripts.py:893 -- This command will now block until terminated by a signal.
2022-06-06 21:36:17,729	INFO scripts.py:896 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-06-06 21:36:16,209	INFO scripts.py:870 -- Local node IP: 172.21.4.81
2022-06-06 21:36:17,730	SUCC scripts.py:882 -- --------------------
2022-06-06 21:36:17,730	SUCC scripts.py:883 -- Ray runtime started.
2022-06-06 21:36:17,730	SUCC scripts.py:884 -- --------------------
2022-06-06 21:36:17,730	INFO scripts.py:886 -- To terminate the Ray runtime, run
2022-06-06 21:36:17,730	INFO scripts.py:887 --   ray stop
2022-06-06 21:36:17,730	INFO scripts.py:892 -- --block
2022-06-06 21:36:17,730	INFO scripts.py:893 -- This command will now block until terminated by a signal.
2022-06-06 21:36:17,730	INFO scripts.py:896 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-06-06 21:36:18,731	ERR scripts.py:907 -- Some Ray subprcesses exited unexpectedly:
2022-06-06 21:36:18,731	ERR scripts.py:911 -- raylet [exit code=-6]
2022-06-06 21:36:18,731	ERR scripts.py:919 -- Remaining processes will be killed.
Exit with error code 1 (suppressed)

@tupui @Chengeng-Yang have either of you encountered this one?

Do you confirm that it actually works as expected with the sleep of 5s? While reading you messages I was not sure.

Correct. If I replace the parallel srun command above with the following, it works well:

sleep 30
# launch worker nodes
worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --job-name="ray-worker" --cpu-bind=none --nodes=1 --ntasks=1 -w "$node_i" \
        conda-run.sh "${WORKER_CMD}" &
  sleep 5
done

Could it be some sort of race condition in communication with the head node?

I noticed one case (out of 10+ runs) where this code did lead to a crash, but that seemed to be a fluke where the the processes did not actually wait 5 seconds (and not 30 seconds after the head node). Not sure what was going on thereā€¦

It looks like race condition indeed. I would suspect that if a worker is taking more time to initialize, then another can use the same port. Hence the 2 workers would conflict.

Out of curiosity, does it really matter in your application that you start workers faster? I would expect SLURM and similar HPC configuration be used for very expensive runs which would mean that the startup time would not be relevant at all in comparison.

Good to know that this may indeed be the cause (and therefore that sequential launching is a good solution).

Indeed, itā€™s not essential for the workers to start at the same time, but itā€™s good to know there is a requirement for the workers to launch sequentially. I will update the launch script and make a note for future enterprising users to not attempt to save a few seconds.

I may, if I find some spare time, try to make an explicit serialization script using flock, so that I donā€™t need to play with guesstimated buffer times.

1 Like

Yes I did, for some hpc clusters (slurm). Sometimes it just suddenly failed after several hours. Error message:

|2022-05-27 01:15:15,255|ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:|
|2022-05-27 01:15:15,281|ERR scripts.py:896 -- gcs_server [exit code=-11]|
|2022-05-27 01:15:15,282|ERR scripts.py:901 -- Remaining processes will be killed.|
|2022-05-27 01:16:19,106|ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:|
|2022-05-27 01:16:19,110|ERR scripts.py:896 -- raylet [exit code=1]|
|2022-05-27 01:16:19,111|ERR scripts.py:901 -- Remaining processes will be killed.|

Configurations of these clusters are listed below:

Model: Intel Xeon Phi 7250 (Knights Landing)
Total cores per KNL node: 68 cores on a single socket
Hardware threads per core: 4
Hardware threads per node: 68 x 4 = 272
Clock rate: 1.4GHz
RAM: 96GB DDR4 plus 16GB high-speed MCDRAM. Configurable in two important ways; see Programming and Performance: KNL for more info.
Cache: 32KB L1 data cache per core; 1MB L2 per two-core tile. In default config, MCDRAM operates as 16GB direct-mapped L3.
Local storage: All but 504 KNL nodes have a 107GB /tmp partition on a 200GB Solid State Drive (SSD). The 504 KNLs originally installed as the Stampede1 KNL sub-system each have a 32GB /tmp partition on 112GB SSDs. The latter nodes currently make up the development, long and flat-quadrant queues. Size of /tmp partitions as of 24 Apr 2018.

Since my code works on other clusters so I just continue my jobs there. Iā€™m not sure the error message I have was due to the same reason as this post but hopefully my post may help address this issue

Thanks for making this post @simontindemans ā€“ Iā€™m asking around to see if we can do anything about this sleep 5 (e.g. solve the underlying issue so workers can start in parallel or without an arbitrary wait).

Iā€™ll mark this issue as resolved, will post any information I find.

Iā€™ve created [Core] [Slurm] Allow parallel startup of Ray workers on Slurm Ā· Issue #25819 Ā· ray-project/ray Ā· GitHub, feel free to add any commentary there. Thanks @simontindemans

2 Likes