Ray on SLURM/HPC: starting worker nodes simultaneously

simontindemans · June 6, 2022, 11:30am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I am attempting to run Ray on an HPC cluster using SLURM, based on the script and information available here: https://docs.ray.io/en/latest/cluster/slurm.html

One thing that seemed inefficient in the instructions is that worker nodes are launched one by one, with a 5-second sleep after each launch. I tried replacing that by a single srun invocation that launches all worker nodes at the same time, but that invariably leads to one or more of them crashing on startup (with various error messages).

Does anyone know what causes this behaviour? Is the head node not able to safely ‘on-board’ multiple worker nodes at the same time? Or do the worker nodes access some non-shareable resource, perhaps?

Alex · June 6, 2022, 5:57pm

I don’t think there are any known issues here, but do you mind sharing your ray version, error message, and relevant parts of your config?

simontindemans · June 6, 2022, 7:40pm

I am running Ray 1.12.1 on an HPC cluster running RHEL 8.5. I am attempting to launch jobs using slurm and the following adapted code (the long 60 second sleep is for debugging purposes):

# launch head node, leaving one core unused for the main python script
echo "STARTING HEAD at $head_node"
srun --job-name="ray-head" --unbuffered --nodes=1 --ntasks=1 -w "$head_node" \
        conda-run.sh "${HEAD_CMD}" &

# if we are running on more than one node, start worker nodes
if [[ $SLURM_JOB_NUM_NODES != "1" ]]
then
 sleep 60  # wait for the head node to fully start before launching worker nodes
 worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
 echo "STARTING ${worker_num} WORKER NODES"
 srun --job-name="ray-workers" --nodes=${worker_num} --ntasks=${worker_num} -w "${worker_nodes}" \
        conda-run.sh "${WORKER_CMD}" &
fi

Whenever I run more than two worker nodes I get errors as follows:

STARTING 2 WORKER NODES
[2022-06-06 21:36:17,728 I 244176 244176] global_state_accessor.cc:357: This node has an IP address of 172.21.4.81, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
2022-06-06 21:36:16,209	INFO scripts.py:870 -- Local node IP: 172.21.4.80
2022-06-06 21:36:17,729	SUCC scripts.py:882 -- --------------------
2022-06-06 21:36:17,729	SUCC scripts.py:883 -- Ray runtime started.
2022-06-06 21:36:17,729	SUCC scripts.py:884 -- --------------------
2022-06-06 21:36:17,729	INFO scripts.py:886 -- To terminate the Ray runtime, run
2022-06-06 21:36:17,729	INFO scripts.py:887 --   ray stop
2022-06-06 21:36:17,729	INFO scripts.py:892 -- --block
2022-06-06 21:36:17,729	INFO scripts.py:893 -- This command will now block until terminated by a signal.
2022-06-06 21:36:17,729	INFO scripts.py:896 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-06-06 21:36:16,209	INFO scripts.py:870 -- Local node IP: 172.21.4.81
2022-06-06 21:36:17,730	SUCC scripts.py:882 -- --------------------
2022-06-06 21:36:17,730	SUCC scripts.py:883 -- Ray runtime started.
2022-06-06 21:36:17,730	SUCC scripts.py:884 -- --------------------
2022-06-06 21:36:17,730	INFO scripts.py:886 -- To terminate the Ray runtime, run
2022-06-06 21:36:17,730	INFO scripts.py:887 --   ray stop
2022-06-06 21:36:17,730	INFO scripts.py:892 -- --block
2022-06-06 21:36:17,730	INFO scripts.py:893 -- This command will now block until terminated by a signal.
2022-06-06 21:36:17,730	INFO scripts.py:896 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2022-06-06 21:36:18,731	ERR scripts.py:907 -- Some Ray subprcesses exited unexpectedly:
2022-06-06 21:36:18,731	ERR scripts.py:911 -- raylet [exit code=-6]
2022-06-06 21:36:18,731	ERR scripts.py:919 -- Remaining processes will be killed.
Exit with error code 1 (suppressed)

Dmitri · June 8, 2022, 8:48am

@tupui @Chengeng-Yang have either of you encountered this one?

tupui · June 8, 2022, 10:46am

Do you confirm that it actually works as expected with the sleep of 5s? While reading you messages I was not sure.

simontindemans · June 8, 2022, 2:34pm

Correct. If I replace the parallel srun command above with the following, it works well:

sleep 30
# launch worker nodes
worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((i = 1; i <= worker_num; i++)); do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --job-name="ray-worker" --cpu-bind=none --nodes=1 --ntasks=1 -w "$node_i" \
        conda-run.sh "${WORKER_CMD}" &
  sleep 5
done

Could it be some sort of race condition in communication with the head node?

I noticed one case (out of 10+ runs) where this code did lead to a crash, but that seemed to be a fluke where the the processes did not actually wait 5 seconds (and not 30 seconds after the head node). Not sure what was going on there…

tupui · June 9, 2022, 10:09am

It looks like race condition indeed. I would suspect that if a worker is taking more time to initialize, then another can use the same port. Hence the 2 workers would conflict.

Out of curiosity, does it really matter in your application that you start workers faster? I would expect SLURM and similar HPC configuration be used for very expensive runs which would mean that the startup time would not be relevant at all in comparison.

simontindemans · June 9, 2022, 12:15pm

Good to know that this may indeed be the cause (and therefore that sequential launching is a good solution).

Indeed, it’s not essential for the workers to start at the same time, but it’s good to know there is a requirement for the workers to launch sequentially. I will update the launch script and make a note for future enterprising users to not attempt to save a few seconds.

I may, if I find some spare time, try to make an explicit serialization script using flock, so that I don’t need to play with guesstimated buffer times.

Chengeng-Yang · June 9, 2022, 3:53pm

Yes I did, for some hpc clusters (slurm). Sometimes it just suddenly failed after several hours. Error message:

|2022-05-27 01:15:15,255|ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:|
|2022-05-27 01:15:15,281|ERR scripts.py:896 -- gcs_server [exit code=-11]|
|2022-05-27 01:15:15,282|ERR scripts.py:901 -- Remaining processes will be killed.|
|2022-05-27 01:16:19,106|ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:|
|2022-05-27 01:16:19,110|ERR scripts.py:896 -- raylet [exit code=1]|
|2022-05-27 01:16:19,111|ERR scripts.py:901 -- Remaining processes will be killed.|

Configurations of these clusters are listed below:

Model:	Intel Xeon Phi 7250 (Knights Landing)
Total cores per KNL node:	68 cores on a single socket
Hardware threads per core:	4
Hardware threads per node:	68 x 4 = 272
Clock rate:	1.4GHz
RAM:	96GB DDR4 plus 16GB high-speed MCDRAM. Configurable in two important ways; see Programming and Performance: KNL for more info.
Cache:	32KB L1 data cache per core; 1MB L2 per two-core tile. In default config, MCDRAM operates as 16GB direct-mapped L3.
Local storage:	All but 504 KNL nodes have a 107GB /tmp partition on a 200GB Solid State Drive (SSD). The 504 KNLs originally installed as the Stampede1 KNL sub-system each have a 32GB /tmp partition on 112GB SSDs. The latter nodes currently make up the development, long and flat-quadrant queues. Size of /tmp partitions as of 24 Apr 2018.

Since my code works on other clusters so I just continue my jobs there. I’m not sure the error message I have was due to the same reason as this post but hopefully my post may help address this issue

cade · June 15, 2022, 9:30pm

Thanks for making this post @simontindemans – I’m asking around to see if we can do anything about this sleep 5 (e.g. solve the underlying issue so workers can start in parallel or without an arbitrary wait).

I’ll mark this issue as resolved, will post any information I find.

cade · June 15, 2022, 9:55pm

I’ve created [Core] [Slurm] Allow parallel startup of Ray workers on Slurm · Issue #25819 · ray-project/ray · GitHub, feel free to add any commentary there. Thanks @simontindemans

Topic		Replies	Views
Unable to start Model Training with Ray Train on SLURM Cluster	2	230	December 17, 2023
Running Ray on Slurm Cluster	11	1239	January 31, 2021
Ray + slurm crashes early in run Ray Core	0	199	March 21, 2024
Ray on slurm - Problems with initialization Ray Clusters	6	3672	December 29, 2022
Ray job start up too slow on workers	0	530	September 11, 2022

Ray on SLURM/HPC: starting worker nodes simultaneously

Related topics