How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi,
After @Alex and @Jules_Damji helped me with the problem How to disable Autoscaler for local cluster - Ray Clusters - Ray , ray cluster works well now and Autoscaler will restart ray after no heartbeat:
2023-03-21 19:00:40,265 WARNING autoscaler.py:1213 -- StandardAutoscaler: 172.22.157.113: No recent heartbeat, restarting Ray to recover...
2023-03-21 19:00:40,265 INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
2023-03-21 19:00:40,274 INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
2023-03-21 19:00:40,278 INFO autoscaler.py:463 -- The autoscaler took 0.016 seconds to complete the update iteration.
2023-03-21 19:00:40,279 INFO monitor.py:430 -- :event_summary:Restarting 2 nodes of type local.cluster.node (lost contact with raylet).
2023-03-21 19:00:41,326 INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
2023-03-21 19:00:42,386 INFO node_provider.py:116 -- ClusterState: Writing cluster state: ['172.22.157.115', '172.22.157.114', '172.22.157.113', '172.22.157.116']
2023-03-21 19:00:45,286 INFO autoscaler.py:144 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-03-21 19:00:45,287 INFO autoscaler.py:419 --
======== Autoscaler status: 2023-03-21 19:00:45.287010 ========
Node status
---------------------------------------------------------------
Healthy:
4 local.cluster.node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/128.0 CPU
0.00/1365.246 GiB memory
0.11/589.097 GiB object_store_memory
Demands:
(no resource demands)
2023-03-21 19:00:45,289
But, after the restart and with all nodes alive in ray status, I run
ray.init(address=‘172.22.157.115:6379’)
on the head node 172.22.157.115 and got the failed info below:
2023-03-23 08:53:50,618 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 172.22.157.115:6379...
[2023-03-23 08:53:50,624 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:51,625 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:52,625 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:53,625 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:54,625 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:55,626 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:56,626 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:57,627 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:58,627 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2023-03-23 08:53:59,627 W 21792 21792] global_state_accessor.cc:389: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/worker.py", line 1514, in init
connect_only=True,
File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/node.py", line 246, in __init__
self._raylet_ip_address,
File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/services.py", line 444, in get_node_to_connect_for_driver
return global_state.get_node_to_connect_for_driver(node_ip_address)
File "/opt/anaconda3/envs/option/lib/python3.7/site-packages/ray/_private/state.py", line 753, in get_node_to_connect_for_driver
node_ip_address
File "python/ray/includes/global_state_accessor.pxi", line 156, in ray._raylet.GlobalStateAccessor.get_node_to_connect_for_driver
RuntimeError: b"This node has an IP address of 172.22.157.115, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at 172.22.157.115 and found raylets at 172.22.157.114, 172.22.157.116, 172.22.157.113 but none of these match this node's IP 172.22.157.115. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."
I don’t know why this happened cause the IP was already the GCS address. Is there any restart GCS method besides using ray down & ray up
to restart ray?
Thanks!