Is anyonce can tell me what the following error means:
2022-07-18 15:40:54,869 INFO proxier.py:642 – New data connection from client 69f43bf09c4048899c51be76aecfe52c:
2022-07-18 15:40:55,920 INFO proxier.py:333 – SpecificServer started on port: 23003 with PID: 4520 for client: 69f43bf09c4048899c51be76aecfe52c
2022-07-18 15:41:25,922 ERROR proxier.py:371 – Timeout waiting for channel for 69f43bf09c4048899c51be76aecfe52c
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py”, line 367, in get_channel
timeout=CHECK_CHANNEL_TIMEOUT_S
File “/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py”, line 140, in result
self._block(timeout)
File “/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py”, line 86, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-18 15:41:25,923 ERROR proxier.py:371 – Timeout waiting for channel for 69f43bf09c4048899c51be76aecfe52c
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py”, line 367, in get_channel
timeout=CHECK_CHANNEL_TIMEOUT_S
File “/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py”, line 140, in result
self._block(timeout)
File “/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py”, line 86, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-18 15:41:25,923 ERROR proxier.py:664 – Channel not found for 69f43bf09c4048899c51be76aecfe52c
2022-07-18 15:41:25,923 WARNING proxier.py:749 – Retrying Logstream connection. 1 attempts failed.
2022-07-18 15:41:39,982 INFO proxier.py:383 – Specific server 69f43bf09c4048899c51be76aecfe52c is no longer running, freeing its port 23003
it happend to only when I’m trying to deploy ray cluster with version greater then 1.8.
I manage to connecto to the head node but failed to submit the remote function.
deployment detailes:
new operator(kuberay)
ray version 1.12.1
python 3.6.9
k8s cluster
Looks like this can happen if the head node fails to setup a gRPC server. Can you provide some more details on where the client is connecting from (outside the cluster? Through an ingress/load balancer?)
As a sanity check can you try running ray.init("ray://localhost:10001") directly on the head node, and see if you can submit a task from there?
Hi @ckw017 , Thanks for replying
I’m running the client from client pod inside the k8s cluster it’s be able to connect to head pod but unable to submit remote function.
I tried ray.init(“ray://localhost:10001”) within the head still the same result.
here is the log from ray_client_server.err:
2022-07-26 17:36:38,393 INFO proxier.py:649 -- New data connection from client 7c272f6de4b74d8a8ddbce3cc4720304:
2022-07-26 17:36:39,426 INFO proxier.py:340 -- SpecificServer started on port: 23002 with PID: 654 for client: 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:09,428 ERROR proxier.py:378 -- Timeout waiting for channel for 7c272f6de4b74d8a8ddbce3cc4720304
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py", line 374, in get_channel
timeout=CHECK_CHANNEL_TIMEOUT_S
File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 140, in result
self._block(timeout)
File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 86, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-26 17:37:09,428 ERROR proxier.py:378 -- Timeout waiting for channel for 7c272f6de4b74d8a8ddbce3cc4720304
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py", line 374, in get_channel
timeout=CHECK_CHANNEL_TIMEOUT_S
File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 140, in result
self._block(timeout)
File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 86, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-26 17:37:09,428 WARNING proxier.py:756 -- Retrying Logstream connection. 1 attempts failed.
2022-07-26 17:37:09,429 ERROR proxier.py:671 -- Channel not found for 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:23,683 INFO proxier.py:390 -- Specific server 7c272f6de4b74d8a8ddbce3cc4720304 is no longer running, freeing its port 23002
2022-07-26 17:37:39,429 INFO proxier.py:722 -- 7c272f6de4b74d8a8ddbce3cc4720304 last started stream at 1658856998.3927054. Current stream started at 1658856998.3927054.
2022-07-26 17:37:41,431 ERROR proxier.py:378 -- Timeout waiting for channel for 7c272f6de4b74d8a8ddbce3cc4720304
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py", line 374, in get_channel
timeout=CHECK_CHANNEL_TIMEOUT_S
File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 140, in result
self._block(timeout)
File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 86, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-26 17:37:41,432 WARNING proxier.py:756 -- Retrying Logstream connection. 2 attempts failed.
2022-07-26 17:37:43,434 ERROR proxier.py:349 -- Unable to find channel for client: 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:43,434 WARNING proxier.py:756 -- Retrying Logstream connection. 3 attempts failed.
2022-07-26 17:37:45,436 ERROR proxier.py:349 -- Unable to find channel for client: 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:45,437 WARNING proxier.py:756 -- Retrying Logstream connection. 4 attempts failed.
2022-07-26 17:37:47,439 ERROR proxier.py:349 -- Unable to find channel for client: 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:47,439 WARNING proxier.py:756 -- Retrying Logstream connection. 5 attempts failed.
I’m with the same ray& python version in both client\head and worker pods
any ideas how to resolve this issue?
k8s rev: v1.19.16.
I’m using my on internal image and I’m install on it ray-x-x-x-py36 wheel(it’s works fine with ray 1.6.0 my problem starts with version higher then 1.10.0)
the image that you send is work as expected and this is my problem I’m trying to figure out where I have conflicts.
Is it at the level of a certain package or in the python version? why it’s work with ray 1.6 but not with 1.13.0 (my internal image is build with python 3.6.9)
Hmm, can you try doing this on the head node: python -m ray.util.client.server --port=10005 --mode specific-server and sharing the output? If something is failing there it might be related to dependency conflict
So I ran python -m ray.util.client.server --port=10005 --mode specific-server and the output is:
2022-08-02 04:37:44,413 INFO server.py:843 – Starting Ray Client server on 0.0.0.0:10005
2022-08-02 04:37:49,433 INFO server.py:890 – 25 idle checks before shutdown.
2022-08-02 04:37:54,440 INFO server.py:890 – 20 idle checks before shutdown.
2022-08-02 04:37:59,450 INFO server.py:890 – 15 idle checks before shutdown.
2022-08-02 04:38:04,460 INFO server.py:890 – 10 idle checks before shutdown.
2022-08-02 04:38:09,470 INFO server.py:890 – 5 idle checks before shutdown.
seems that noting failed but still can’t execute ray.init(“ray://localhost:10001”) from head