Ray on slurm - different ip addresses of worker nodes

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi there,

I’m writing a post to ask how I would be able to connect worker nodes with ip addresses different from the head node.

I wrote my script following this tutorial: . slurm-basic.sh — Ray 1.12.0. I’m using slurm on stampede 2, a really large hpc cluster. Therefore, when I request multiple nodes, their ip addresses seem different. Here’re the error messages I got:

+ head_node=c455-001
+ srun --nodes=1 --ntasks=1 -w c455-001 ray start --head --node-ip-address=xxx.xx.xxx.231 --port=6379 --block
+ worker_num=15
+ (( i = 1 ))
+ (( i <= worker_num ))

+ node_i=c455-014
+ echo 'Starting WORKER 7 at c455-014'
Starting WORKER 7 at c455-014
+ sleep 5
+ srun --nodes=1 --ntasks=1 -w c455-014 ray start --address xxx.xx.xxx.231:6379 --block
[2022-05-09 01:50:04,704 I 95279 95279] global_state_accessor.cc:357: This node has an IP address of xxx.xx.xxx.236, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

(raylet, ip=xxx.xx.xxx.236) [2022-05-09 01:52:28,892 E 95302 95302] (raylet) worker_pool.cc:518: Some workers of the worker process(95507) have not registered within the timeout. The process is still alive, probably it's hanging during start.

Could anyone help me with linking worker nodes with different ip addresses to the head node? I wasn’t sure if the tutorial here: Using Ray on a Large Cluster — Ray 0.01 documentation is the right answer or not. Many thanks in advance!

@Dmitri + @Alex can you please help answer this question?

@rliaw Do you know who the best folks are to answer Ray-on-Slurm questions?
(I did some git-blaming and noticed that you had helped shepherd many of the relevant PRs.)

Thank you all @Ameer_Haj_Ali @Dmitri @rliaw for trying to forward this question to your coworkers to find out the solutions. Actually, I found some people having the same issue, but I still couldn’t fix it:

Could you please take a look or forward this post to someone you know? Many thanks!

Hi @Chengeng-Yang, I suspect this has something to do with the following issue: [Feature] [core] Selecting network interface · Issue #22732 · ray-project/ray · GitHub

This looks similar to the issue I had on a HPC configuration were I had multiple networking interfaces. Could you confirm that you have a multiple networks (eth0, eth1, etc.)? Depending on your system you can access this with ip a or ifconfig.

Typically you will have a public network (for the connection to the frontal nodes) and an internal network for communication between the nodes. Usually, the public network only has a few ports open as opposed to the internal network.

Ray is using the default interface, which could mean the public network. Hence it would not work. The current work around involves connecting to the redis database after starting Ray on the main node to update the IP to use the private address instead of the public one.

Hi @tupui , thanks for your reply! By running ip a I found 3 networks on my local machine (OS: Ubuntu 18), and I was wondering if only the third network is responsible for connection to HPC.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.x.x.x/x scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx xxx ff:ff:ff:ff:ff:ff
3: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx xxx ff:ff:ff:ff:ff:ff
    inet 192.xxx.x.x/xx brd 192.xxx.x.xxx scope global dynamic noprefixroute enp0s31f6
       valid_lft 167227sec preferred_lft 167227sec
    (followed by a few lines of inet6 xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxx:xxxx/64 scope global temporary dynamic / scope global temporary deprecated dynamic )

Oh, are you trying to run Ray on your local machine and use nodes from a cluster? If so there is an even higher chance of having networking issues. I would instead try to launch Ray when connected to a node (from the pool you requested). If you confirm the networking issue (cf the linked issue). Then the workaround is to connect to the redis database and change GcsServerAddress to the use the internal network.

Hi @tupui , sorry for the confusion – I was trying to run Ray on the HPC cluster instead of my local machine. Because the ip a / ifconfig command doesn’t work on HPC (it’s called stampede 2), I was thinking you referred to the network on my local machine.

Could you please tell me how to connect to the redis database and change GcsServerAddress to the use the internal network? Thanks in advance!

It’s not very straightforward. But for completeness it would look like the following. You need a Redis client, e.g. redis-cli. Then you could do something like redis-cli -h host -p port -a password. Then you should be able to check the value of the key with GET GcsServerAddress and change its value with SET GcsServerAddress new_address.

Thanks for your reply! But my connection to redis-cli wasn’t successful.

redis-cli -h $head_node_ip -p 6379 -a $redis_password
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at 206.xx.xxx.xxx:6379: Connection refused
not connected>

where I used redis_password=$(uuidgen) to get $redis_password and the following commands

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

to get my hostname, $head_node_ip.

In this case, is any other way to GET GcsServerAddress and if so, what new_address should be put in SET GcsServerAddress new_address? Many thanks!

First, did you confirm that this is really a networking issue? Do you have the same error message as in the issue I posted?

It’s difficult to help you on this. Are you on the head_node? If you are connected to a node, started ray, then it should be the head_node and Redis should be running there. As far as I know, there is no other way at the moment than modifying the key manually (there is no fix yet, that I know of, being worked out in Ray itself).

Thanks for your response! I really appreciate your help so far.

Yes, I actually requested multiple nodes in an interact session. I had access to all the nodes and defined the first node as my head_node like the template does (slurm-basic.sh — Ray 1.12.0).

I made it to connect to redis-cli via $head_node, thanks for your help. But I got Error: Protocol error, got "\x00" as reply type byte when I entered GET GcsServerAddress. I’m still looking up the reason.

I was suspecting that could be one of the possible reasons – I finally made ifconfig work on the hpc that I was talking about, and I found 4 networks:

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

But I didn’t have the same error message as in the issue you posted. So I’m not 100% sure about it. (I’ve tested this code ray/simple-trainer.py at master · ray-project/ray · GitHub and it worked fine, so I think Ray started properly in this case.) (updates 05/12/2022: it only works when num_cpus<176. Error messages as follows)

Traceback (most recent call last):
  File "../../simple-trainer.py", line 28, in <module>
    ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 1811, in get
    raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
2022-05-12 14:03:09,977	ERR scripts.py:889 -- Some Ray subprcesses exited unexpectedly:
2022-05-12 14:03:09,977	ERR scripts.py:896 -- raylet [exit code=1]
2022-05-12 14:03:09,978	ERR scripts.py:901 -- Remaining processes will be killed.
(ray) c469-083[knl](1003)$ srun: error: c469-083: task 0: Exited with exit code 1

I was trying to reproduce the error message I posted earlier, but didn’t succeed.
The old error message is (raylet, ip=xxx.xx.xxx.236) [2022-05-09 01:52:28,892 E 95302 95302] (raylet) worker_pool.cc:518: Some workers of the worker process(95507) have not registered within the timeout. The process is still alive, probably it's hanging during start.

Instead, the error message I get now is quite different. It seems like such error is something to do with the Pool function from ray.util.multiprocessing.pool.

Traceback (most recent call last):
  File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 926, in _bootstrap_inner
  File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 473, in print_logs
    data = subscriber.poll()
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/gcs_pubsub.py", line 376, in poll
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/gcs_pubsub.py", line 266, in _poll_locked
    self._poll_request(), timeout=timeout
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 976, in future
    (operations,), event_handler, self._context)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 1306, in create
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/grpc/_channel.py", line 1270, in _run_channel_spin_thread
  File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 117, in grpc._cython.cygrpc.ForkManagedThread.start
  File "/home1/anaconda3/envs/ray/lib/python3.7/threading.py", line 852, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Traceback (most recent call last):
  File "../../hydration_whole_global2_ray.py", line 287, in <module>
    with Pool(ray_address="auto") as worker_pool:
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/multiprocessing/pool.py", line 507, in __init__
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/multiprocessing/pool.py", line 546, in _start_actor_pool
    ray.get([actor.ping.remote() for actor, _ in self._actor_pool])
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home1/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py", line 1811, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Here’s a brief summary of my code:

import ray
from ray.util.multiprocessing.pool import Pool

def hydration_water_calculation2(t, u): # in one frame
  return xyz

run_per_frame = partial(hydration_water_calculation2, u = u0)
frame_values = np.arange(0, 2501)
with Pool(ray_address="auto") as worker_pool:
    result = worker_pool.map(run_per_frame, frame_values)

I also tried starting ray calling ray.init(address=os.environ[“ip_head”]) before creating Pool() , or calling ray.init(address="auto", _redis_password = os.environ["redis_password"]) and register_ray() before creating Pool() , but got the same RuntimeError: can't start new thread.

In this case, do you happen to know the tricks to set up Pool? Many thanks and I’ll keep trying to see if I can GET GcsServerAddress. :grinning:

You might want to have a look at this thread: python - error: can't start new thread - Stack Overflow

Seems like your problem could simply be an over subscription issue. Either from memory or platform limitation.

Yes! That’s of great help. For some reason the clusters I’m using don’t like users to take neither all the available cores of a node nor all the available threads of a core, which I didn’t realize when testing my codes.

As I specify smaller values for both --num-cpus in ray start and --ntasks-per-node when requesting nodes, it seems like Ray is able to initialize without issues, even if the notification from global_state_accessor is still there.

Thanks a lot for your help on this! I really appreciate it :grinning: