Failed to connect after autoscaler restart

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi,
After @Alex and @Jules_Damji helped me with the problem How to disable Autoscaler for local cluster - Ray Clusters - Ray , ray cluster works well now and Autoscaler will restart ray after no heartbeat:

2023-03-21 19:00:40,265	WARNING autoscaler.py:1213 -- StandardAutoscaler: 172.22.157.113: No recent heartbeat, restarting Ray to recover...
2023-03-21 19:00:40,265	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
2023-03-21 19:00:40,274	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
2023-03-21 19:00:40,278	INFO autoscaler.py:463 -- The autoscaler took 0.016 seconds to complete the update iteration.
2023-03-21 19:00:40,279	INFO monitor.py:430 -- :event_summary:Restarting 2 nodes of type local.cluster.node (lost contact with raylet).
2023-03-21 19:00:41,326	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
2023-03-21 19:00:42,386	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
2023-03-21 19:00:45,286	INFO autoscaler.py:144 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-03-21 19:00:45,287	INFO autoscaler.py:419 -- 
======== Autoscaler status: 2023-03-21 19:00:45.287010 ========
Node status
---------------------------------------------------------------
Healthy:
 4 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.00/1365.246 GiB memory
 0.11/589.097 GiB object_store_memory

Demands:
 (no resource demands)
2023-03-21 19:00:45,289

But, after the restart and with all nodes alive in ray status, I run

ray.init(address=‘172.22.157.115:6379’)

on the head node 172.22.157.115 and got the failed info below:

2023-03-23 08:53:50,618	INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 172.22.157.115:6379...
[2023-03-23 08:53:50,624 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:51,625 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:52,625 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:53,625 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:54,625 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:55,626 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:56,626 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:57,627 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:58,627 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:59,627 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/worker.py", line 1514, in init
    connect_only=True,
  File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/node.py", line 246, in __init__
    self._raylet_ip_address,
  File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/services.py", line 444, in get_node_to_connect_for_driver
    return global_state.get_node_to_connect_for_driver(node_ip_address)
  File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/state.py", line 753, in get_node_to_connect_for_driver
    node_ip_address
  File "python/ray/includes/global_state_accessor.pxi", line 156, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
RuntimeError: b"This node has an IP address of 172.22.157.115, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at 172.22.157.115 and found raylets at 172.22.157.114, 172.22.157.116, 172.22.157.113 but none of these match this node's IP 172.22.157.115. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."

I don’t know why this happened cause the IP was already the GCS address. Is there any restart GCS method besides using ray down & ray up to restart ray?

Thanks!

Hmm can you try ray.init(address="auto")? Also can you confirm the output of ray up when you started the cluster

Hi Alex, for ray.init(address="auto"), I got exactly the same log as ray.init(address=‘172.22.157.115:6379’).

And here is the output for ray up command:

Cluster: local

Checking Local environment settings
2023-04-02 19:29:43,226	INFO node_provider.py:54 -- ClusterState: Loaded cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
No head node found. Launching a new cluster. Confirm [y/N]: y

Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Acquiring an up-to-date head node
2023-04-02 19:29:46,676	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
  Launched a new head node
  Fetching the new head node
  
<1/1> Setting up head node
  Prepared bootstrap config
2023-04-02 19:29:46,677	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 172.22.157.115
Warning: Permanently added '172.22.157.115' (ECDSA) to the list of known hosts.
root@172.22.157.115's password: 
 19:29:53 up 17 days, 10:24,  2 users,  load average: 0.06, 0.04, 0.05
Shared connection to 172.22.157.115 closed.
    Success.
  Updating cluster configuration. [hash=28012af5186108bed9d09d224f9163866885c55a]
2023-04-02 19:29:53,267	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
  New status: syncing-files
  [2/7] Processing file mounts
Shared connection to 172.22.157.115 closed.
  [3/7] No worker file mounts to sync
2023-04-02 19:29:54,083	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initializing command runner
  [6/7] Running setup commands
    (0/1) source activate && conda activ...
Shared connection to 172.22.157.115 closed.
  [7/7] Starting the Ray runtime
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.22.157.115

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='172.22.157.115:6379'
  
  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto')
  
  To see the status of the cluster, use
    ray status
  To monitor and debug Ray, view the dashboard at 
    172.22.157.115:8265
  
  If connection fails, check your firewall settings and network configuration.
  
  To terminate the Ray runtime, run
    ray stop
Shared connection to 172.22.157.115 closed.
2023-04-02 19:29:57,916	INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec ray_cluster_config.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach ray_cluster_config.yaml
  Get a remote shell to the cluster manually:
    ssh -o IdentitiesOnly=yes root@172.22.157.115

Hi Alex, do you have any idea about my problem? I have to restart ray every morning due to this problem :sob:. Any advice would be helpful thanks!

I’m getting this error after upgrade from 2.2 to 2.4.0. It requires me to restart the application to work again. I see no reason for the error to appears, as it was working perfectly well before.

For me it appears usually as fallowing

Initializing Ray on several SLURM nodes → works okay. Closing app on the nodes through SLURM
Sending another job to SLURM using Ray core 2.4 → This error → canceling job and resending → works once again