How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
Hi there,
I’m writing a post to ask how I would be able to connect worker nodes with ip addresses different from the head node.
I wrote my script following this tutorial: . slurm-basic.sh — Ray 1.12.0. I’m using slurm on stampede 2, a really large hpc cluster. Therefore, when I request multiple nodes, their ip addresses seem different. Here’re the error messages I got:
+ head_node=c455-001
+ srun --nodes=1 --ntasks=1 -w c455-001 ray start --head --node-ip-address=xxx.xx.xxx.231 --port=6379 --block
+ worker_num=15
+ (( i = 1 ))
+ (( i <= worker_num ))
+ node_i=c455-014
+ echo 'Starting WORKER 7 at c455-014'
Starting WORKER 7 at c455-014
+ sleep 5
+ srun --nodes=1 --ntasks=1 -w c455-014 ray start --address xxx.xx.xxx.231:6379 --block
[2022-05-09 01:50:04,704 I 95279 95279] global_state_accessor.cc:357: This node has an IP address of xxx.xx.xxx.236, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
(raylet, ip=xxx.xx.xxx.236) [2022-05-09 01:52:28,892 E 95302 95302] (raylet) worker_pool.cc:518: Some workers of the worker process(95507) have not registered within the timeout. The process is still alive, probably it's hanging during start.
Could anyone help me with linking worker nodes with different ip addresses to the head node? I wasn’t sure if the tutorial here: Using Ray on a Large Cluster — Ray 0.01 documentation is the right answer or not. Many thanks in advance!
@rliaw Do you know who the best folks are to answer Ray-on-Slurm questions?
(I did some git-blaming and noticed that you had helped shepherd many of the relevant PRs.)
Thank you all @Ameer_Haj_Ali@Dmitri@rliaw for trying to forward this question to your coworkers to find out the solutions. Actually, I found some people having the same issue, but I still couldn’t fix it:
Could you please take a look or forward this post to someone you know? Many thanks!
This looks similar to the issue I had on a HPC configuration were I had multiple networking interfaces. Could you confirm that you have a multiple networks (eth0, eth1, etc.)? Depending on your system you can access this with ip a or ifconfig.
Typically you will have a public network (for the connection to the frontal nodes) and an internal network for communication between the nodes. Usually, the public network only has a few ports open as opposed to the internal network.
Ray is using the default interface, which could mean the public network. Hence it would not work. The current work around involves connecting to the redis database after starting Ray on the main node to update the IP to use the private address instead of the public one.
Hi @tupui , thanks for your reply! By running ip a I found 3 networks on my local machine (OS: Ubuntu 18), and I was wondering if only the third network is responsible for connection to HPC.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.x.x.x/x scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether xx:xx:xx:xx:xx:xx xxx ff:ff:ff:ff:ff:ff
3: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether xx:xx:xx:xx:xx:xx xxx ff:ff:ff:ff:ff:ff
inet 192.xxx.x.x/xx brd 192.xxx.x.xxx scope global dynamic noprefixroute enp0s31f6
valid_lft 167227sec preferred_lft 167227sec
(followed by a few lines of inet6 xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxx:xxxx/64 scope global temporary dynamic / scope global temporary deprecated dynamic )
Oh, are you trying to run Ray on your local machine and use nodes from a cluster? If so there is an even higher chance of having networking issues. I would instead try to launch Ray when connected to a node (from the pool you requested). If you confirm the networking issue (cf the linked issue). Then the workaround is to connect to the redis database and change GcsServerAddress to the use the internal network.
Hi @tupui , sorry for the confusion – I was trying to run Ray on the HPC cluster instead of my local machine. Because the ip a / ifconfig command doesn’t work on HPC (it’s called stampede 2), I was thinking you referred to the network on my local machine.
Could you please tell me how to connect to the redis database and change GcsServerAddress to the use the internal network? Thanks in advance!
It’s not very straightforward. But for completeness it would look like the following. You need a Redis client, e.g. redis-cli. Then you could do something like redis-cli -h host -p port -a password. Then you should be able to check the value of the key with GET GcsServerAddress and change its value with SET GcsServerAddress new_address.
Thanks for your reply! But my connection to redis-cli wasn’t successful.
redis-cli -h $head_node_ip -p 6379 -a $redis_password
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at 206.xx.xxx.xxx:6379: Connection refused
not connected>
where I used redis_password=$(uuidgen) to get $redis_password and the following commands
First, did you confirm that this is really a networking issue? Do you have the same error message as in the issue I posted?
It’s difficult to help you on this. Are you on the head_node? If you are connected to a node, started ray, then it should be the head_node and Redis should be running there. As far as I know, there is no other way at the moment than modifying the key manually (there is no fix yet, that I know of, being worked out in Ray itself).
I made it to connect to redis-cli via $head_node, thanks for your help. But I got Error: Protocol error, got "\x00" as reply type byte when I entered GET GcsServerAddress. I’m still looking up the reason.
I was suspecting that could be one of the possible reasons – I finally made ifconfig work on the hpc that I was talking about, and I found 4 networks:
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
Traceback (most recent call last):
File "../../simple-trainer.py", line 28, in <module>
ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 1811, in get
raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
2022-05-12 14:03:09,977 ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:
2022-05-12 14:03:09,977 ERR scripts.py:896 -- raylet [exit code=1]
2022-05-12 14:03:09,978 ERR scripts.py:901 -- Remaining processes will be killed.
(ray) c469-083[knl](1003)$ srun: error: c469-083: task 0: Exited with exit code 1
I was trying to reproduce the error message I posted earlier, but didn’t succeed.
The old error message is (raylet, ip=xxx.xx.xxx.236) [2022-05-09 01:52:28,892 E 95302 95302] (raylet) worker_pool.cc:518: Some workers of the worker process(95507) have not registered within the timeout. The process is still alive, probably it's hanging during start.
Instead, the error message I get now is quite different. It seems like such error is something to do with the Pool function from ray.util.multiprocessing.pool.
Traceback (most recent call last):
File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 473, in print_logs
data = subscriber.poll()
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/gcs_pubsub.py", line 376, in poll
self._poll_locked(timeout=timeout)
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/gcs_pubsub.py", line 266, in _poll_locked
self._poll_request(), timeout=timeout
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 976, in future
(operations,), event_handler, self._context)
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 1306, in create
_run_channel_spin_thread(state)
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 1270, in _run_channel_spin_thread
channel_spin_thread.start()
File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 117, in grpc._cython.cygrpc.ForkManagedThread.start
File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Traceback (most recent call last):
File "../../hydration_whole_global2_ray.py", line 287, in <module>
with Pool(ray_address="auto") as worker_pool:
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/multiprocessing/pool.py", line 507, in __init__
self._start_actor_pool(processes)
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/multiprocessing/pool.py", line 546, in _start_actor_pool
ray.get([actor.ping.remote() for actor, _ in self._actor_pool])
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 1811, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Here’s a brief summary of my code:
import ray
from ray.util.multiprocessing.pool import Pool
def hydration_water_calculation2(t, u): # in one frame
return xyz
run_per_frame = partial(hydration_water_calculation2, u = u0)
frame_values = np.arange(0, 2501)
with Pool(ray_address="auto") as worker_pool:
result = worker_pool.map(run_per_frame, frame_values)
I also tried starting ray calling ray.init(address=os.environ[“ip_head”]) before creating Pool() , or calling ray.init(address="auto", _redis_password = os.environ["redis_password"]) and register_ray() before creating Pool() , but got the same RuntimeError: can't start new thread.
In this case, do you happen to know the tricks to set up Pool? Many thanks and I’ll keep trying to see if I can GET GcsServerAddress.
Yes! That’s of great help. For some reason the clusters I’m using don’t like users to take neither all the available cores of a node nor all the available threads of a core, which I didn’t realize when testing my codes.
As I specify smaller values for both --num-cpus in ray start and --ntasks-per-node when requesting nodes, it seems like Ray is able to initialize without issues, even if the notification from global_state_accessor is still there.
Thanks a lot for your help on this! I really appreciate it
Hi @tupui!
I’m getting the same error on an HPC SLURM cluster:
global_state_accessor.cc:356: This node has an IP address of 172.16.204.7, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
However, the experiments start running after this error. So can this error be safely ignored?