Ray on slurm - different ip addresses of worker nodes

Chengeng-Yang · May 9, 2022, 7:07am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi there,

I’m writing a post to ask how I would be able to connect worker nodes with ip addresses different from the head node.

I wrote my script following this tutorial: . slurm-basic.sh — Ray 1.12.0. I’m using slurm on stampede 2, a really large hpc cluster. Therefore, when I request multiple nodes, their ip addresses seem different. Here’re the error messages I got:


+ head_node=c455-001
+ srun --nodes=1 --ntasks=1 -w c455-001 ray start --head --node-ip-address=xxx.xx.xxx.231 --port=6379 --block
+ worker_num=15
+ (( i = 1 ))
+ (( i <= worker_num ))

+ node_i=c455-014
+ echo 'Starting WORKER 7 at c455-014'
Starting WORKER 7 at c455-014
+ sleep 5
+ srun --nodes=1 --ntasks=1 -w c455-014 ray start --address xxx.xx.xxx.231:6379 --block
[2022-05-09 01:50:04,704 I 95279 95279] global_state_accessor.cc:357: This node has an IP address of xxx.xx.xxx.236, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

(raylet, ip=xxx.xx.xxx.236) [2022-05-09 01:52:28,892 E 95302 95302] (raylet) worker_pool.cc:518: Some workers of the worker process(95507) have not registered within the timeout. The process is still alive, probably it's hanging during start.

Could anyone help me with linking worker nodes with different ip addresses to the head node? I wasn’t sure if the tutorial here: Using Ray on a Large Cluster — Ray 0.01 documentation is the right answer or not. Many thanks in advance!

Ameer_Haj_Ali · May 9, 2022, 7:09am

@Dmitri + @Alex can you please help answer this question?

Dmitri · May 9, 2022, 8:34pm

@rliaw Do you know who the best folks are to answer Ray-on-Slurm questions?
(I did some git-blaming and noticed that you had helped shepherd many of the relevant PRs.)

Chengeng-Yang · May 10, 2022, 5:09am

Thank you all @Ameer_Haj_Ali @Dmitri @rliaw for trying to forward this question to your coworkers to find out the solutions. Actually, I found some people having the same issue, but I still couldn’t fix it:

github.com/ray-project/ray

Initializing cluster on SLURM causes "we can not found the matched Raylet address" warning

opened 12:37AM - 02 Aug 21 UTC

import-antigravity

bug regression triage QS

### What is the problem? Running Ray 1.5.1 ### Reproduction (REQUIRED) …I'm using the following SLURM script, which is basically copypasted from the documentation: ``` #!/bin/bash #SBATCH --job-name="experiment" #SBATCH --output=experiment.out #SBATCH -N 5 #SBATCH --mem=32gb #SBATCH --tasks-per-node=1 #SBATCH -p class -C gpu2080 #SBATCH --gres=gpu:2 #SBATCH --cpus-per-task=8 #SBATCH --gpus-per-task=2 #SBATCH --time=36:00:00 echo "Loading modules..." module swap intel gcc module load cuda/10.1 source ~/miniconda3/etc/profile.d/conda.sh conda activate [mycondaenv] # __doc_head_address_start__ # Getting the node names nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") nodes_array=($nodes) head_node=${nodes_array[0]} head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) # __doc_head_ray_start__ port=6379 ip_head=$head_node_ip:$port export ip_head echo "IP Head: $ip_head" echo "Starting HEAD at $head_node" srun --nodes=1 --ntasks=1 --gres=gpu:2 -w "$head_node" \ ray start --head --node-ip-address="$head_node_ip" --port=$port \ --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" \ --dashboard-host "$head_node_ip" --block & # __doc_head_ray_end__ # __doc_worker_ray_start__ # number of nodes other than the head node worker_num=$((SLURM_JOB_NUM_NODES - 1)) for ((i = 1; i <= worker_num; i++)); do node_i=${nodes_array[$i]} echo "Starting WORKER $i at $node_i" srun --nodes=1 --ntasks=1 --gres=gpu:2 -w "$node_i" \ ray start --address "$ip_head" \ --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block & sleep 1 done # __doc_worker_ray_end__ # __doc_script_start__ # Run experiment # "sbatch -p class run_experiment.slurm [path/to/config]" python run_experiment.py ${SLURM_CPUS_PER_TASK} $1 ``` `run_experiment.py` can really have anything as the warnings happen before `ray.init()` The warnings are as follows: ``` [2021-08-01 20:06:41,407 I 28552 28552] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container. [2021-08-01 20:06:41,435 I 7652 7652] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container. [2021-08-01 20:06:41,435 I 14548 14548] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container. [2021-08-01 20:06:41,437 I 4763 4763] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container. ``` If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script". - [ X] I have verified my script runs in a clean environment and reproduces the issue. - [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html). Is this something we should be concerned about?

github.com/ray-project/ray

[Bug] Ray not initializing with Slurm

opened 02:32PM - 01 Nov 21 UTC

closed 07:57PM - 09 Feb 22 UTC

andrewkhardy

bug triage

### Search before asking - [X] I searched the [issues](https://github.com/ray…-project/ray/issues) and found no similar issues. ### Ray Component Ray Core ### What happened + What you expected to happen I think I have the same problem as #16009,or possibly #17491, but as this comes from the default [Slurm](https://docs.ray.io/en/latest/cluster/examples/slurm-template.html#slurm-template) template, I think it is worth fixing. Running this default template after putting in my directories and number of cores and whatnot gives me the following error for each worker node ``` This node has an IP address of 10.20.1._, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container. ``` The process then only runs on the head node. What should be changed in the template? I'm also running a Python program and trying to use Ray as a backend to joblib, so I'm confused as to where or not I need to put a `ray.init()` command in the Python program and if I feed it an address and redis password, whether than will distribute to all the worker notes. Right now I have a test as ``` from ray.util.joblib import register_ray import ray import os from joblib import Parallel,parallel_backend, delayed from tqdm import tqdm def TestFunction(x): return x**2 #ray.init(address="auto", _redis_password = os.environ["redis_password"]) register_ray() with parallel_backend("ray"): observables = Parallel(verbose = 10)(delayed(TestFunction)(i) \ for i in tqdm(range(100))) ``` ### Versions / Dependencies Ray 1.7.0 joblib ### Reproduction script ``` #!/bin/bash #SBATCH --nodes=3 #SBATCH --ntasks-per-node=1 #SBATCH --account=def-aparamek #SBATCH --time=23:45:00 # ===== DO NOT CHANGE THINGS HERE UNLESS YOU KNOW WHAT YOU ARE DOING ===== # This script is a modification to the implementation suggest by gregSchwartz18 here: # https://github.com/ray-project/ray/issues/826#issuecomment-522116599 redis_password=$(uuidgen) export redis_password nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") # Getting the node names nodes_array=($nodes) node_1=${nodes_array[0]} ip=$(srun --nodes=1 --ntasks=1 -w "$node_1" hostname --ip-address) # making redis-address # if we detect a space character in the head node IP, we'll # convert it to an ipv4 address. This step is optional. if [[ "$ip" == *" "* ]]; then IFS=' ' read -ra ADDR <<< "$ip" if [[ ${#ADDR[0]} -gt 16 ]]; then ip=${ADDR[1]} else ip=${ADDR[0]} fi echo "IPV6 address detected. We split the IPV4 address as $ip" fi port=6379 ip_head=$ip:$port export ip_head echo "IP Head: $ip_head" echo "STARTING HEAD at $node_1" srun --nodes=1 --ntasks=1 -w "$node_1" \ ray start --head --node-ip-address="$ip" --port=$port --redis-password="$redis_password" --block & sleep 30 worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node for ((i = 1; i <= worker_num; i++)); do node_i=${nodes_array[$i]} echo "STARTING WORKER $i at $node_i" srun --nodes=1 --ntasks=1 -w "$node_i" ray start --address "$ip_head" --redis-password="$redis_password" --block & sleep 5 done # ===== Call your code below ===== python raytestcode.py dummy_input ``` ### Anything else _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR!

Could you please take a look or forward this post to someone you know? Many thanks!

tupui · May 10, 2022, 9:00am

Hi @Chengeng-Yang, I suspect this has something to do with the following issue: [Feature] [core] Selecting network interface · Issue #22732 · ray-project/ray · GitHub

This looks similar to the issue I had on a HPC configuration were I had multiple networking interfaces. Could you confirm that you have a multiple networks (eth0, eth1, etc.)? Depending on your system you can access this with ip a or ifconfig.

Typically you will have a public network (for the connection to the frontal nodes) and an internal network for communication between the nodes. Usually, the public network only has a few ports open as opposed to the internal network.

Ray is using the default interface, which could mean the public network. Hence it would not work. The current work around involves connecting to the redis database after starting Ray on the main node to update the IP to use the private address instead of the public one.

Chengeng-Yang · May 10, 2022, 3:07pm

Hi @tupui , thanks for your reply! By running ip a I found 3 networks on my local machine (OS: Ubuntu 18), and I was wondering if only the third network is responsible for connection to HPC.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.x.x.x/x scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx xxx ff:ff:ff:ff:ff:ff
3: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx xxx ff:ff:ff:ff:ff:ff
    inet 192.xxx.x.x/xx brd 192.xxx.x.xxx scope global dynamic noprefixroute enp0s31f6
       valid_lft 167227sec preferred_lft 167227sec

    (followed by a few lines of inet6 xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxx:xxxx/64 scope global temporary dynamic / scope global temporary deprecated dynamic )

tupui · May 10, 2022, 4:01pm

Oh, are you trying to run Ray on your local machine and use nodes from a cluster? If so there is an even higher chance of having networking issues. I would instead try to launch Ray when connected to a node (from the pool you requested). If you confirm the networking issue (cf the linked issue). Then the workaround is to connect to the redis database and change GcsServerAddress to the use the internal network.

Chengeng-Yang · May 10, 2022, 4:28pm

Hi @tupui , sorry for the confusion – I was trying to run Ray on the HPC cluster instead of my local machine. Because the ip a / ifconfig command doesn’t work on HPC (it’s called stampede 2), I was thinking you referred to the network on my local machine.

Could you please tell me how to connect to the redis database and change GcsServerAddress to the use the internal network? Thanks in advance!

tupui · May 10, 2022, 4:50pm

It’s not very straightforward. But for completeness it would look like the following. You need a Redis client, e.g. redis-cli. Then you could do something like redis-cli -h host -p port -a password. Then you should be able to check the value of the key with GET GcsServerAddress and change its value with SET GcsServerAddress new_address.

Chengeng-Yang · May 10, 2022, 6:33pm

Thanks for your reply! But my connection to redis-cli wasn’t successful.

redis-cli -h $head_node_ip -p 6379 -a $redis_password
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at 206.xx.xxx.xxx:6379: Connection refused
not connected>

where I used redis_password=$(uuidgen) to get $redis_password and the following commands

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

to get my hostname, $head_node_ip.

In this case, is any other way to GET GcsServerAddress and if so, what new_address should be put in SET GcsServerAddress new_address? Many thanks!

tupui · May 11, 2022, 10:38am

First, did you confirm that this is really a networking issue? Do you have the same error message as in the issue I posted?

It’s difficult to help you on this. Are you on the head_node? If you are connected to a node, started ray, then it should be the head_node and Redis should be running there. As far as I know, there is no other way at the moment than modifying the key manually (there is no fix yet, that I know of, being worked out in Ray itself).

Chengeng-Yang · May 12, 2022, 5:42am

Thanks for your response! I really appreciate your help so far.

Yes, I actually requested multiple nodes in an interact session. I had access to all the nodes and defined the first node as my head_node like the template does (https://docs.ray.io/en/latest/cluster/examples/slurm-basic.html#slurm-basic).

I made it to connect to redis-cli via $head_node, thanks for your help. But I got Error: Protocol error, got "\x00" as reply type byte when I entered GET GcsServerAddress. I’m still looking up the reason.

I was suspecting that could be one of the possible reasons – I finally made ifconfig work on the hpc that I was talking about, and I found 4 networks:

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

But I didn’t have the same error message as in the issue you posted. So I’m not 100% sure about it. (I’ve tested this code https://github.com/ray-project/ray/blob/master/doc/source/cluster/examples/simple-trainer.py and it worked fine, so I think Ray started properly in this case.) (updates 05/12/2022: it only works when num_cpus<176. Error messages as follows)

Traceback (most recent call last):
  File "../../simple-trainer.py", line 28, in <module>
    ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 1811, in get
    raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
2022-05-12 14:03:09,977	ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:
2022-05-12 14:03:09,977	ERR scripts.py:896 -- raylet [exit code=1]
2022-05-12 14:03:09,978	ERR scripts.py:901 -- Remaining processes will be killed.
(ray) c469-083[knl](1003)$ srun: error: c469-083: task 0: Exited with exit code 1

I was trying to reproduce the error message I posted earlier, but didn’t succeed.
The old error message is (raylet, ip=xxx.xx.xxx.236) [2022-05-09 01:52:28,892 E 95302 95302] (raylet) worker_pool.cc:518: Some workers of the worker process(95507) have not registered within the timeout. The process is still alive, probably it's hanging during start.

Instead, the error message I get now is quite different. It seems like such error is something to do with the Pool function from ray.util.multiprocessing.pool.

Traceback (most recent call last):
  File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 473, in print_logs
    data = subscriber.poll()
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/gcs_pubsub.py", line 376, in poll
    self._poll_locked(timeout=timeout)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/gcs_pubsub.py", line 266, in _poll_locked
    self._poll_request(), timeout=timeout
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 976, in future
    (operations,), event_handler, self._context)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 1306, in create
    _run_channel_spin_thread(state)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 1270, in _run_channel_spin_thread
    channel_spin_thread.start()
  File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 117, in grpc._cython.cygrpc.ForkManagedThread.start
  File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 852, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Traceback (most recent call last):
  File "../../hydration_whole_global2_ray.py", line 287, in <module>
    with Pool(ray_address="auto") as worker_pool:
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/multiprocessing/pool.py", line 507, in __init__
    self._start_actor_pool(processes)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/multiprocessing/pool.py", line 546, in _start_actor_pool
    ray.get([actor.ping.remote() for actor, _ in self._actor_pool])
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 1811, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Here’s a brief summary of my code:

import ray
from ray.util.multiprocessing.pool import Pool

def hydration_water_calculation2(t, u): # in one frame
  return xyz

run_per_frame = partial(hydration_water_calculation2, u = u0)
frame_values = np.arange(0, 2501)
with Pool(ray_address="auto") as worker_pool:
    result = worker_pool.map(run_per_frame, frame_values)

I also tried starting ray calling ray.init(address=os.environ[“ip_head”]) before creating Pool() , or calling ray.init(address="auto", _redis_password = os.environ["redis_password"]) and register_ray() before creating Pool() , but got the same RuntimeError: can't start new thread.

In this case, do you happen to know the tricks to set up Pool? Many thanks and I’ll keep trying to see if I can GET GcsServerAddress.

tupui · May 13, 2022, 10:39am

You might want to have a look at this thread: python - error: can't start new thread - Stack Overflow

Seems like your problem could simply be an over subscription issue. Either from memory or platform limitation.

Chengeng-Yang · May 14, 2022, 3:39am

Yes! That’s of great help. For some reason the clusters I’m using don’t like users to take neither all the available cores of a node nor all the available threads of a core, which I didn’t realize when testing my codes.

As I specify smaller values for both --num-cpus in ray start and --ntasks-per-node when requesting nodes, it seems like Ray is able to initialize without issues, even if the notification from global_state_accessor is still there.

Thanks a lot for your help on this! I really appreciate it

hw-ju · August 29, 2023, 2:53am

Hi @tupui!
I’m getting the same error on an HPC SLURM cluster:
global_state_accessor.cc:356: This node has an IP address of 172.16.204.7, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.

However, the experiments start running after this error. So can this error be safely ignored?

Topic		Replies	Views
Issues with Ray Connecting on SLURM Ray Core	0	405	November 1, 2021
Ray on SLURM, unmatched Raylet address Ray Clusters	2	987	December 1, 2022
Ray crashes on Slurm Ray Clusters	6	1372	October 27, 2022
Client can't find Raylet address in Ray > 1.5 Ray Clusters	3	2587	December 2, 2022
Raylet errors some worker have not registered within the timeout Ray Core	31	3597	March 30, 2023

Ray on slurm - different ip addresses of worker nodes

Related topics