Ray cluster version 1.9.2, python 3.9.7
When connecting to an existing ray runtime, if the ray cluster is down, the command ray.get_actor
gets stuck for 10min before actually returning an error.
>>> import ray
>>> ray.init('auto')
2022-03-29 19:28:35,796 INFO worker.py:842 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
{'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': '127.0.0.1:6379', 'object_store_address': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903', 'metrics_export_port': 56934, 'node_id': 'b77f6ee82315b7a30724208121b58fa3e88c38ca973dd8feb20d0aef'}
>>> ray.get_actor('act')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "lib/python3.9/site-packages/ray/worker.py", line 1925, in get_actor
return worker.core_worker.get_named_actor_handle(name, namespace or "")
File "python/ray/_raylet.pyx", line 1748, in ray._raylet.CoreWorker.get_named_actor_handle
File "python/ray/_raylet.pyx", line 158, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'act'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
>>> 2022-03-29 19:28:50,519 ERROR worker.py:1247 -- listen_error_messages_raylet: Connection closed by server.
2022-03-29 19:28:50,523 ERROR worker.py:478 -- print_logs: Connection closed by server.
2022-03-29 19:28:50,525 ERROR import_thread.py:89 -- ImportThread: Connection closed by server.
>>> ray.get_actor('act') <-- gets stuck
[2022-03-29 19:29:59,757 E 87875 2588905] actor_manager.cc:90: There was timeout in getting the actor handle, probably because the GCS server is dead or under high load .
[2022-03-29 21:04:26,650 E 87875 2590212] gcs_server_address_updater.cc:76: Failed to receive the GCS address for 600 times without success. The worker will exit ungracefully. It is because GCS has died. It could be because there was an issue that kills GCS, such as high memory usage triggering OOM killer to kill GCS. Cluster will be highly likely unavailable if you see this log. Please check the log from gcs_server.err.
This also seems to happen if we can ray.init("auto")
when the ray cluster is down. Note that ray.is_initialized
returns true
in this scenario