Periodic _MultiThreadedRendezvous failure leaves cluster in damaged state

[Posting description of possible bug per @sangcho request from another thread]

We are running a big job on our Ray cluster that works a few times and then fails with the error message at the bottom of this post. Once that happens, all subsequent runs generate the same error message until the cluster is down’d and then re-up’d.

The error message is contradictory in that it first claims 10.01.34 is neither a Redis address nor a Raylet address:

This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses.

But then the error message goes on to explicitly list that same IP in both those groups:

Connected to Redis at 10.0.1.34:6379 and found raylets at ... 10.0.1.34 ...

After this happens, the autoscaler does appear to have stuck nodes hanging around, as the status message is showing 47 CPUs in use after the failure, even though no other jobs are running:

======== Autoscaler status: 2021-07-12 12:18:32.602594 ========
Node status
---------------------------------------------------------------
Healthy:
 43 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 47.0/1952.0 CPU
 0.0/34.0 GPU
 0.0/3.0 accelerator_type:T
 0.00/2878.539 GiB memory
 3.11/1237.651 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1471+ pending tasks/actors

Does anyone have an idea what might be going on, or how to workaround?

Note that we are running a nightly snapshot from a few weeks ago which includes some other bug fixes we needed… here is the URL/commit: https://s3-us-west-2.amazonaws.com/ray-wheels/master/f5f23448fcab7c896e478e9e5c9804a73688ec41/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

Not sure how to advise on reproduction due to the complexity of the jobs, however we can consistently reproduce in our lab if anyone wants us to extract more debug information.

Here is the full error message:

Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Queue filler thread failed to  join before timeout: 5
[2021-07-12 04:34:54,535 ERROR  ] Got Error from data channel -- shutting down:
Traceback (most recent call last):
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 69, in _data_main
    for response in resp_stream:
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
  status = StatusCode.UNKNOWN
  details = "Exception iterating responses: <_MultiThreadedRendezvous of RPC that terminated with:
  status = StatusCode.UNKNOWN
  details = "Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."
  debug_error_string = "{"created":"@1626078894.533785785","description":"Error received from peer ipv6:[::1]:23008","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node.","grpc_status":2}"
>"
  debug_error_string = "{"created":"@1626078894.535529655","description":"Error received from peer ipv4:10.0.1.34:10001","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Exception iterating responses: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = "Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."\n\tdebug_error_string = "{"created":"@1626078894.533785785","description":"Error received from peer ipv6:[::1]:23008","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node.","grpc_status":2}"\n>","grpc_status":2}"
>
Traceback (most recent call last):
  File "/ceph/var/elcano/archives/inputFiles/v311_live.5.build/mnesRanking.py", line 585, in <module>
    clientBuilder.connect()
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/client_builder.py", line 97, in connect
    self.address, job_config=self._job_config)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client_connect.py", line 36, in connect
    ignore_version=ignore_version)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/__init__.py", line 83, in connect
    self.client_worker._server_init(job_config)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/worker.py", line 525, in _server_init
    self._call_init(init_req)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/worker.py", line 503, in _call_init
    init_resp = self.data_client.Init(init_request)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 151, in Init
    resp = self._blocking_send(datareq)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 126, in _blocking_send
    "Sending request failed because the data channel "
ConnectionError: Sending request failed because the data channel terminated. This is usually due to an error in handling the most recent request: req_id: 1
init {
  job_config: "\200\004\225\265\001\000\000\000\000\000\000\214\016ray.job_config\224\214\tJobConfig\224\223\224)\201\224}\224(\214\nworker_env\224}\224\214\021AUTOSCALER_EVENTS\224\214\0010\224s\214\034num_java_workers_per_process\224K\001\214\013jvm_options\224]\224\214\020code_search_path\224]\224\214\nclient_job\224\211\214\010metadata\224}\224\214\rray_namespace\224N\214\023_parsed_runtime_env\224\214\030ray._private.runtime_env\224\214\016RuntimeEnvDict\224\223\224)\201\224}\224\214\005_dict\224}\224(\214\013working_dir\224N\214\005conda\224\214\010dan-py37\224\214\003pip\224N\214\010env_vars\224}\224h\007h\010s\214\013_ray_commit\224\214(be6db0648546f51d16dba4ccc6a8ddcff6d10400\224usb\214\013runtime_env\224}\224(h\036h\037h\033h\034u\214\n_cached_pb\224Nub."
}

Hmm it seems like the designated raylet in the IP address has a problem. Is it possible for you to get raylet.out & raylet.err logs from /tmp/ray/session_latest/logs in the IP 10.0.1.34?

Hi @sangcho, sorry for the delay. We had to retool our job submissions to avoid the issue and I’m stuck behind a big queue of production jobs. I’ll need a few more days to hopefully reproduce. I just wanted to let you know I’m not ignoring your request, thanks!

1 Like

@sangcho We’ve run into a similar error – the ip address in question refers to the Ray head node in a k8s cluster, which still seems to be running. However, all clients failed and it’s impossible to reconnect to it. Raylet.err is here: raylet.err - Pastebin.com, and raylet.out is here: raylet.out - Pastebin.com. Any thoughts on why this might be happening?

At the time I pulled the logs, this is what was running on the head node:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   64 root      20   0 1190080  29380    988 S   2.3   0.7  13:05.96 /venv/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=20.30.81.95 --redis_port=6379 --log_dir=/tmp/ray/session_2021-12-09_19-37-00_002981_50/logs --config_list=eyJvYmplY3R+
    1 root      20   0    5732    256      0 S   0.0   0.0   0:00.01 /bin/bash -c -- trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;
    8 root      20   0    4300    124      8 S   0.0   0.0   0:02.15 tail -f /tmp/raylogs
   54 root      20   0   83388   7852   2344 S   0.0   0.2   1:17.12 /venv/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
   59 root      20   0   68028   7948    996 S   0.0   0.2   0:18.79 /venv/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:59586
   65 root      20   0 2534684  49800      0 S   0.0   1.3   1:16.58 /venv/bin/python -m ray.util.client.server --redis-address=20.30.81.95:6379 --port=10001 --mode=proxy --redis-password=5241590000000000 --metrics-agent-port=60186
   81 root      20   0  289312  42504   2232 S   0.0   1.1   2:10.28 /venv/bin/python -u /venv/lib/python3.7/site-packages/ray/_private/log_monitor.py --redis-address=20.30.81.95:6379 --logs-dir=/tmp/ray/session_2021-12-09_19-37-00_002981_50/logs --logging-rotate-byte+
  624 root      20   0    5996    648      0 S   0.0   0.0   0:00.00 /bin/bash
  625 root      20   0 1929956 127152   5008 S   0.0   3.2   5:33.76 /venv/bin/python -u /venv/lib/python3.7/site-packages/ray/dashboard/dashboard.py --host=0.0.0.0 --port=8265 --port-retries=0 --redis-address=20.30.81.95:6379 --temp-dir=/tmp/ray --log-dir=/tmp/ray/se+
 3437 root      20   0    5996   3888   3220 S   0.0   0.1   0:00.10 /bin/bash
 3657 root      20   0    7996   3616   3112 R   0.0   0.1   0:00.00 top

Reading logs, it seems like your local raylet has died. You are supposed to have raylet process on your node. The error message seems to be just a bad one (that doesn’t handle raylet death properly).

[2021-12-09 22:49:08,685 I 80 80] worker_pool.cc:419: Started worker process of 1 worker(s) with pid 3445
[2021-12-09 22:49:08,877 I 80 80] service_based_accessor.cc:611: Received notification for node id = 0c148c687e1720cb475e2fc77c4532a3dfc6ce30e729a08f1c6fd58c, IsAlive = 0
[2021-12-09 22:49:08,877 I 80 80] service_based_accessor.cc:611: Received notification for node id = 6f0f4408f28568b739ef0b2ea5d3f754130aa169d5de1aa0886c1ec5, IsAlive = 0
[2021-12-09 22:49:09,273 C 80 80] node_manager.cc:794:  Check failed: node_id != self_node_id_ Exiting because this node manager has mistakenly been marked dead by the monitor: GCS didn't receive heartbeats within timeout 30000 ms. This is likely since the machine or raylet became overloaded.
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::raylet::NodeManager::NodeRemoved()
    ray::gcs::ServiceBasedNodeInfoAccessor::HandleNotification()
    std::_Function_handler<>::_M_invoke()
    ray::gcs::GcsPubSub::ExecuteCommandIfPossible()::{lambda()#1}::operator()()
    std::_Function_handler<>::_M_invoke()
    std::_Function_handler<>::_M_invoke()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

The log indicate your raylet has been overloaded and killed because it couldn’t communicate with the head node for enough time. Are you running a driver in a non-head node?

My understanding is the driver is what calls ray.init() and starts tasks. If that’s correct, our driver is run on a client machine outside the cluster, and it connects using the Ray Client (ray.init(address="ray://oururl.ourdomain.com:10001")).

These logs came from the head node itself, which was what surprised me.

Ray client is just a proxy to the head node ray_client_server. That says, ray client server will create a driver for you and run tasks for you from a head node.

So, I think the problem is the head node raylet is dead, and ray client failed to connect to the driver, which raises an exception. gRPC returns UNKNOWN status if the server raises an exception (which seems to be your case).

So, the root cause must be that the raylet is dead. It seems like it is dead because it couldn’t send heartbeat. What’s your workload look like? Also, while you are running tasks do you observe high CPU usage from the raylet?

The head node only has our ServeController and the HTTP Proxy actors used for Ray Serve; our cluster is set with rayResources: {"CPU": 0} on the head node pod type to avoid task scheduling there – all of our tasks are sent to actors that run on worker pods, and there can be 10-40 tasks running at once (ML inference).

Typically, the head node has relatively low CPU usage. (20-30%)