Ray.get_actor() gets stuck if ray runtime is down

Cherif_Jazra · March 30, 2022, 2:32am

Ray cluster version 1.9.2, python 3.9.7

When connecting to an existing ray runtime, if the ray cluster is down, the command ray.get_actor gets stuck for 10min before actually returning an error.

>>> import ray
>>> ray.init('auto')
2022-03-29 19:28:35,796	INFO worker.py:842 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
{'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': '127.0.0.1:6379', 'object_store_address': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-03-29_19-28-24_559941_87903', 'metrics_export_port': 56934, 'node_id': 'b77f6ee82315b7a30724208121b58fa3e88c38ca973dd8feb20d0aef'}
>>> ray.get_actor('act')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "lib/python3.9/site-packages/ray/worker.py", line 1925, in get_actor
    return worker.core_worker.get_named_actor_handle(name, namespace or "")
  File "python/ray/_raylet.pyx", line 1748, in ray._raylet.CoreWorker.get_named_actor_handle
  File "python/ray/_raylet.pyx", line 158, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'act'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
>>> 2022-03-29 19:28:50,519	ERROR worker.py:1247 -- listen_error_messages_raylet: Connection closed by server.
2022-03-29 19:28:50,523	ERROR worker.py:478 -- print_logs: Connection closed by server.
2022-03-29 19:28:50,525	ERROR import_thread.py:89 -- ImportThread: Connection closed by server.

>>> ray.get_actor('act') <-- gets stuck

[2022-03-29 19:29:59,757 E 87875 2588905] actor_manager.cc:90: There was timeout in getting the actor handle, probably because the GCS server is dead or under high load .

[2022-03-29 21:04:26,650 E 87875 2590212] gcs_server_address_updater.cc:76: Failed to receive the GCS address for 600 times without success. The worker will exit ungracefully. It is because GCS has died. It could be because there was an issue that kills GCS, such as high memory usage triggering OOM killer to kill GCS. Cluster will be highly likely unavailable if you see this log. Please check the log from gcs_server.err.

This also seems to happen if we can ray.init("auto") when the ray cluster is down. Note that ray.is_initialized returns true in this scenario

Chen_Shen · April 3, 2022, 9:59pm

@Cherif_Jazra thanks for reporting. I think there must be some 10minutes timeout associated with this.

Cherif_Jazra · April 5, 2022, 10:25pm

Thanks @Chen_Shen , yes indeed, I’ve identified the timeouts:

ray.get_actor() taking 1min is driven by gcs_server_request_timeout_seconds which is 60s defined (ray/ray_config_def.h at ef593fe5d3c864836b80ae77be32635cef42b537 · ray-project/ray · GitHub)
ray.shutdown taking 10min to recover is driven by a separate timeout, as it calls UpdateGcsServerAddress 600 times every 1s to connect before it gives up, driven by ping_gcs_rpc_server_max_retries (600, ray/ray_config_def.h at ef593fe5d3c864836b80ae77be32635cef42b537 · ray-project/ray · GitHub) and interval timer ( gcs_service_address_check_interval_milliseconds(ray/ray_config_def.h at ef593fe5d3c864836b80ae77be32635cef42b537 · ray-project/ray · GitHub).

These configurations are retrieved from environment variables, when RayConfig is initialized here (ray/ray_config.h at ef593fe5d3c864836b80ae77be32635cef42b537 · ray-project/ray · GitHub)

Does anyone have experience reducing this timeout numbers?

Topic		Replies	Views
Worker.get_objects time out Ray Core	0	87	August 14, 2024
Ray 1.7.0 ray.init(runtime_env=) kills cluster (was: cluster stuck on "The actor or task with ID [] cannot be scheduled right now") Ray Core	5	1261	October 18, 2021
Timeout for ray.get_actor() function Ray Core	4	370	May 24, 2023
"RuntimeError: Lost reference to actor" when using get_actor directly (i.e. not storing in a temporary variable) Ray Core	2	993	March 15, 2021
How to analysis or debug the connecting procedure	8	886	March 23, 2021

Ray.get_actor() gets stuck if ray runtime is down

Related topics