Periodic _MultiThreadedRendezvous failure leaves cluster in damaged state

[Posting description of possible bug per @sangcho request from another thread]

We are running a big job on our Ray cluster that works a few times and then fails with the error message at the bottom of this post. Once that happens, all subsequent runs generate the same error message until the cluster is down’d and then re-up’d.

The error message is contradictory in that it first claims 10.01.34 is neither a Redis address nor a Raylet address:

This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses.

But then the error message goes on to explicitly list that same IP in both those groups:

Connected to Redis at 10.0.1.34:6379 and found raylets at ... 10.0.1.34 ...

After this happens, the autoscaler does appear to have stuck nodes hanging around, as the status message is showing 47 CPUs in use after the failure, even though no other jobs are running:

======== Autoscaler status: 2021-07-12 12:18:32.602594 ========
Node status
---------------------------------------------------------------
Healthy:
 43 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 47.0/1952.0 CPU
 0.0/34.0 GPU
 0.0/3.0 accelerator_type:T
 0.00/2878.539 GiB memory
 3.11/1237.651 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1471+ pending tasks/actors

Does anyone have an idea what might be going on, or how to workaround?

Note that we are running a nightly snapshot from a few weeks ago which includes some other bug fixes we needed… here is the URL/commit: https://s3-us-west-2.amazonaws.com/ray-wheels/master/f5f23448fcab7c896e478e9e5c9804a73688ec41/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

Not sure how to advise on reproduction due to the complexity of the jobs, however we can consistently reproduce in our lab if anyone wants us to extract more debug information.

Here is the full error message:

Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Queue filler thread failed to  join before timeout: 5
[2021-07-12 04:34:54,535 ERROR  ] Got Error from data channel -- shutting down:
Traceback (most recent call last):
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 69, in _data_main
    for response in resp_stream:
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
  status = StatusCode.UNKNOWN
  details = "Exception iterating responses: <_MultiThreadedRendezvous of RPC that terminated with:
  status = StatusCode.UNKNOWN
  details = "Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."
  debug_error_string = "{"created":"@1626078894.533785785","description":"Error received from peer ipv6:[::1]:23008","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node.","grpc_status":2}"
>"
  debug_error_string = "{"created":"@1626078894.535529655","description":"Error received from peer ipv4:10.0.1.34:10001","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Exception iterating responses: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = "Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."\n\tdebug_error_string = "{"created":"@1626078894.533785785","description":"Error received from peer ipv6:[::1]:23008","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node.","grpc_status":2}"\n>","grpc_status":2}"
>
Traceback (most recent call last):
  File "/ceph/var/elcano/archives/inputFiles/v311_live.5.build/mnesRanking.py", line 585, in <module>
    clientBuilder.connect()
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/client_builder.py", line 97, in connect
    self.address, job_config=self._job_config)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client_connect.py", line 36, in connect
    ignore_version=ignore_version)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/__init__.py", line 83, in connect
    self.client_worker._server_init(job_config)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/worker.py", line 525, in _server_init
    self._call_init(init_req)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/worker.py", line 503, in _call_init
    init_resp = self.data_client.Init(init_request)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 151, in Init
    resp = self._blocking_send(datareq)
  File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 126, in _blocking_send
    "Sending request failed because the data channel "
ConnectionError: Sending request failed because the data channel terminated. This is usually due to an error in handling the most recent request: req_id: 1
init {
  job_config: "\200\004\225\265\001\000\000\000\000\000\000\214\016ray.job_config\224\214\tJobConfig\224\223\224)\201\224}\224(\214\nworker_env\224}\224\214\021AUTOSCALER_EVENTS\224\214\0010\224s\214\034num_java_workers_per_process\224K\001\214\013jvm_options\224]\224\214\020code_search_path\224]\224\214\nclient_job\224\211\214\010metadata\224}\224\214\rray_namespace\224N\214\023_parsed_runtime_env\224\214\030ray._private.runtime_env\224\214\016RuntimeEnvDict\224\223\224)\201\224}\224\214\005_dict\224}\224(\214\013working_dir\224N\214\005conda\224\214\010dan-py37\224\214\003pip\224N\214\010env_vars\224}\224h\007h\010s\214\013_ray_commit\224\214(be6db0648546f51d16dba4ccc6a8ddcff6d10400\224usb\214\013runtime_env\224}\224(h\036h\037h\033h\034u\214\n_cached_pb\224Nub."
}

Hmm it seems like the designated raylet in the IP address has a problem. Is it possible for you to get raylet.out & raylet.err logs from /tmp/ray/session_latest/logs in the IP 10.0.1.34?

Hi @sangcho, sorry for the delay. We had to retool our job submissions to avoid the issue and I’m stuck behind a big queue of production jobs. I’ll need a few more days to hopefully reproduce. I just wanted to let you know I’m not ignoring your request, thanks!

1 Like