How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi all, recently I needed to restart my ray cluster and I noticed it stopped working (I didn’t change any dependencies, I didn’t change machines, I didn’t change networking between nodes). It started printing errors like:
> [2023-02-04 22:00:09,593 I 12894 12894] global_state_accessor.cc:357: This node has an IP address of <ip>, while we can not find the matched Raylet address.
This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
This error is very suspicious, because connecting to head node seems non-deterministic (!!)
If I run the command "ray start --address=‘:6379’ " it might print the error above. Then I do “ray stop” and execute exactly the same command, and this time it might (or not) work. Sometimes I need to execute stop && start a few times before it connects without this error.
Example flow of my command and the problem below:
(venv) $ ray start --address='$IP_HEAD:6379'
Local node IP: <IP_NODE>
[2023-02-04 22:44:31,742 I 1266809 1266809] global_state_accessor.cc:357: This node has an IP address of <IP_NODE>, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
(venv) $ ray stop
Stopped all 3 Ray processes.
(venv) $ ray start --address='$IP_HEAD:6379'
Local node IP: <IP_NODE>
[2023-02-04 22:44:48,119 I 1267197 1267197] global_state_accessor.cc:357: This node has an IP address of <IP_NODE>, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
(venv) $ ray stop
Stopped all 3 Ray processes.
(venv) $ ray start --address='$IP_HEAD:6379'
Local node IP: <IP_NODE>
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
As you see, when I run it for the 3rd time, it worked.. But why didn’t it run correctly on the first try?
I tested it on 4 different machines in the same network. The ray version is 2.1.0, but on 2.2.0 I observed same problem (although with 2.2.0 I just tested 1 node)