Non-deterministic connecting of nodes to the head node?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi all, recently I needed to restart my ray cluster and I noticed it stopped working (I didn’t change any dependencies, I didn’t change machines, I didn’t change networking between nodes). It started printing errors like:

> [2023-02-04 22:00:09,593 I 12894 12894] global_state_accessor.cc:357: This node has an IP address of <ip>, while we can not find the matched Raylet address. 
This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

This error is very suspicious, because connecting to head node seems non-deterministic (!!)

If I run the command "ray start --address=‘:6379’ " it might print the error above. Then I do “ray stop” and execute exactly the same command, and this time it might (or not) work. Sometimes I need to execute stop && start a few times before it connects without this error.

Example flow of my command and the problem below:

(venv) $ ray start --address='$IP_HEAD:6379'
Local node IP: <IP_NODE>
[2023-02-04 22:44:31,742 I 1266809 1266809] global_state_accessor.cc:357: This node has an IP address of <IP_NODE>, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop
(venv) $ ray stop
Stopped all 3 Ray processes.
(venv) $ ray start --address='$IP_HEAD:6379'
Local node IP: <IP_NODE>
[2023-02-04 22:44:48,119 I 1267197 1267197] global_state_accessor.cc:357: This node has an IP address of <IP_NODE>, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop
(venv) $ ray stop
Stopped all 3 Ray processes.
(venv) $ ray start --address='$IP_HEAD:6379'
Local node IP: <IP_NODE>

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop

As you see, when I run it for the 3rd time, it worked.. But why didn’t it run correctly on the first try?
I tested it on 4 different machines in the same network. The ray version is 2.1.0, but on 2.2.0 I observed same problem (although with 2.2.0 I just tested 1 node)

Hi @pk123 , thanks for reporting this and for sharing the details – it sounds like you have a way of reliably reproducing this on a single node, which is great news for debugging it. Do you mind posting this as an issue on the Ray github?