[Posting description of possible bug per @sangcho request from another thread]
We are running a big job on our Ray cluster that works a few times and then fails with the error message at the bottom of this post. Once that happens, all subsequent runs generate the same error message until the cluster is down’d and then re-up’d.
The error message is contradictory in that it first claims 10.01.34 is neither a Redis address nor a Raylet address:
This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses.
But then the error message goes on to explicitly list that same IP in both those groups:
Connected to Redis at 10.0.1.34:6379 and found raylets at ... 10.0.1.34 ...
After this happens, the autoscaler does appear to have stuck nodes hanging around, as the status message is showing 47 CPUs in use after the failure, even though no other jobs are running:
======== Autoscaler status: 2021-07-12 12:18:32.602594 ========
Node status
---------------------------------------------------------------
Healthy:
43 local.cluster.node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
47.0/1952.0 CPU
0.0/34.0 GPU
0.0/3.0 accelerator_type:T
0.00/2878.539 GiB memory
3.11/1237.651 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1471+ pending tasks/actors
Does anyone have an idea what might be going on, or how to workaround?
Note that we are running a nightly snapshot from a few weeks ago which includes some other bug fixes we needed… here is the URL/commit: https://s3-us-west-2.amazonaws.com/ray-wheels/master/f5f23448fcab7c896e478e9e5c9804a73688ec41/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
Not sure how to advise on reproduction due to the complexity of the jobs, however we can consistently reproduce in our lab if anyone wants us to extract more debug information.
Here is the full error message:
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Queue filler thread failed to join before timeout: 5
[2021-07-12 04:34:54,535 ERROR ] Got Error from data channel -- shutting down:
Traceback (most recent call last):
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 69, in _data_main
for response in resp_stream:
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating responses: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."
debug_error_string = "{"created":"@1626078894.533785785","description":"Error received from peer ipv6:[::1]:23008","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node.","grpc_status":2}"
>"
debug_error_string = "{"created":"@1626078894.535529655","description":"Error received from peer ipv4:10.0.1.34:10001","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Exception iterating responses: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = "Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."\n\tdebug_error_string = "{"created":"@1626078894.533785785","description":"Error received from peer ipv6:[::1]:23008","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: This node has an IP address of 10.0.1.34, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.0.1.34:6379 and found raylets at 10.0.1.17, 10.0.1.103, 10.0.1.12, 10.0.1.33, 10.0.1.5, 10.0.1.2, 10.0.1.38, 10.0.1.15, 10.0.1.37, 10.0.1.11, 10.0.1.40, 10.0.1.24, 10.0.1.26, 10.0.1.3, 10.0.1.10, 10.0.1.6, 10.0.1.32, 10.0.1.36, 10.0.1.23, 10.0.1.21, 10.0.1.14, 10.0.1.29, 10.0.1.25, 10.0.1.104, 10.0.1.19, 10.0.1.35, 10.0.1.13, 10.0.1.101, 10.0.1.18, 10.0.1.20, 10.0.1.7, 10.0.1.1, 10.0.1.4, 10.0.1.30, 10.0.1.22, 10.0.1.16, 10.0.1.27, 10.0.1.9, 10.0.1.28, 10.0.1.8, 10.0.1.39, 10.0.1.31, 10.0.1.34 but none of these match this node's IP 10.0.1.34. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node.","grpc_status":2}"\n>","grpc_status":2}"
>
Traceback (most recent call last):
File "/ceph/var/elcano/archives/inputFiles/v311_live.5.build/mnesRanking.py", line 585, in <module>
clientBuilder.connect()
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/client_builder.py", line 97, in connect
self.address, job_config=self._job_config)
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client_connect.py", line 36, in connect
ignore_version=ignore_version)
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/__init__.py", line 83, in connect
self.client_worker._server_init(job_config)
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/worker.py", line 525, in _server_init
self._call_init(init_req)
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/worker.py", line 503, in _call_init
init_resp = self.data_client.Init(init_request)
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 151, in Init
resp = self._blocking_send(datareq)
File "/ceph/var/users/focusvq/anaconda3/envs/ray2/lib/python3.7/site-packages/ray/util/client/dataclient.py", line 126, in _blocking_send
"Sending request failed because the data channel "
ConnectionError: Sending request failed because the data channel terminated. This is usually due to an error in handling the most recent request: req_id: 1
init {
job_config: "\200\004\225\265\001\000\000\000\000\000\000\214\016ray.job_config\224\214\tJobConfig\224\223\224)\201\224}\224(\214\nworker_env\224}\224\214\021AUTOSCALER_EVENTS\224\214\0010\224s\214\034num_java_workers_per_process\224K\001\214\013jvm_options\224]\224\214\020code_search_path\224]\224\214\nclient_job\224\211\214\010metadata\224}\224\214\rray_namespace\224N\214\023_parsed_runtime_env\224\214\030ray._private.runtime_env\224\214\016RuntimeEnvDict\224\223\224)\201\224}\224\214\005_dict\224}\224(\214\013working_dir\224N\214\005conda\224\214\010dan-py37\224\214\003pip\224N\214\010env_vars\224}\224h\007h\010s\214\013_ray_commit\224\214(be6db0648546f51d16dba4ccc6a8ddcff6d10400\224usb\214\013runtime_env\224}\224(h\036h\037h\033h\034u\214\n_cached_pb\224Nub."
}