Ray.get_actor() gets stuck if ray runtime is down

Ray cluster version 1.9.2, python 3.9.7

When connecting to an existing ray runtime, if the ray cluster is down, the command ray.get_actor gets stuck for 10min before actually returning an error.

>>> import ray
>>> ray.init('auto')
2022-03-29 19:28:35,796	INFO worker.py:842 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
{'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': '127.0.0.1:6379', 'object_store_address': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903', 'metrics_export_port': 56934, 'node_id': 'b77f6ee82315b7a30724208121b58fa3e88c38ca973dd8feb20d0aef'}
>>> ray.get_actor('act')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "lib/python3.9/site-packages/ray/worker.py", line 1925, in get_actor
    return worker.core_worker.get_named_actor_handle(name, namespace or "")
  File "python/ray/_raylet.pyx", line 1748, in ray._raylet.CoreWorker.get_named_actor_handle
  File "python/ray/_raylet.pyx", line 158, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'act'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
>>> 2022-03-29 19:28:50,519	ERROR worker.py:1247 -- listen_error_messages_raylet: Connection closed by server.
2022-03-29 19:28:50,523	ERROR worker.py:478 -- print_logs: Connection closed by server.
2022-03-29 19:28:50,525	ERROR import_thread.py:89 -- ImportThread: Connection closed by server.
>>> ray.get_actor('act') <-- gets stuck

[2022-03-29 19:29:59,757 E 87875 2588905] actor_manager.cc:90: There was timeout in getting the actor handle, probably because the GCS server is dead or under high load .

[2022-03-29 21:04:26,650 E 87875 2590212] gcs_server_address_updater.cc:76: Failed to receive the GCS address for 600 times without success. The worker will exit ungracefully. It is because GCS has died. It could be because there was an issue that kills GCS, such as high memory usage triggering OOM killer to kill GCS. Cluster will be highly likely unavailable if you see this log. Please check the log from gcs_server.err.

This also seems to happen if we can ray.init("auto") when the ray cluster is down. Note that ray.is_initialized returns true in this scenario

@Cherif_Jazra thanks for reporting. I think there must be some 10minutes timeout associated with this.

Thanks @Chen_Shen , yes indeed, I’ve identified the timeouts:

  1. ray.get_actor() taking 1min is driven by gcs_server_request_timeout_seconds which is 60s defined (ray/ray_config_def.h at ef593fe5d3c864836b80ae77be32635cef42b537 · ray-project/ray · GitHub)
  2. ray.shutdown taking 10min to recover is driven by a separate timeout, as it calls UpdateGcsServerAddress 600 times every 1s to connect before it gives up, driven by ping_gcs_rpc_server_max_retries (600, ray/ray_config_def.h at ef593fe5d3c864836b80ae77be32635cef42b537 · ray-project/ray · GitHub) and interval timer ( gcs_service_address_check_interval_milliseconds(ray/ray_config_def.h at ef593fe5d3c864836b80ae77be32635cef42b537 · ray-project/ray · GitHub).

These configurations are retrieved from environment variables, when RayConfig is initialized here (ray/ray_config.h at ef593fe5d3c864836b80ae77be32635cef42b537 · ray-project/ray · GitHub)

Does anyone have experience reducing this timeout numbers?