Unable to submit remote function- k8s cluster ray version 1.12.1

Hi Team,

Is anyonce can tell me what the following error means:
2022-07-18 15:40:54,869 INFO proxier.py:642 – New data connection from client 69f43bf09c4048899c51be76aecfe52c:
2022-07-18 15:40:55,920 INFO proxier.py:333 – SpecificServer started on port: 23003 with PID: 4520 for client: 69f43bf09c4048899c51be76aecfe52c
2022-07-18 15:41:25,922 ERROR proxier.py:371 – Timeout waiting for channel for 69f43bf09c4048899c51be76aecfe52c
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py”, line 367, in get_channel
timeout=CHECK_CHANNEL_TIMEOUT_S
File “/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py”, line 140, in result
self._block(timeout)
File “/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py”, line 86, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-18 15:41:25,923 ERROR proxier.py:371 – Timeout waiting for channel for 69f43bf09c4048899c51be76aecfe52c
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py”, line 367, in get_channel
timeout=CHECK_CHANNEL_TIMEOUT_S
File “/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py”, line 140, in result
self._block(timeout)
File “/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py”, line 86, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-18 15:41:25,923 ERROR proxier.py:664 – Channel not found for 69f43bf09c4048899c51be76aecfe52c
2022-07-18 15:41:25,923 WARNING proxier.py:749 – Retrying Logstream connection. 1 attempts failed.
2022-07-18 15:41:39,982 INFO proxier.py:383 – Specific server 69f43bf09c4048899c51be76aecfe52c is no longer running, freeing its port 23003
it happend to only when I’m trying to deploy ray cluster with version greater then 1.8.
I manage to connecto to the head node but failed to submit the remote function.

deployment detailes:
new operator(kuberay)
ray version 1.12.1
python 3.6.9
k8s cluster

Thanks,

Looks like this can happen if the head node fails to setup a gRPC server. Can you provide some more details on where the client is connecting from (outside the cluster? Through an ingress/load balancer?)

As a sanity check can you try running ray.init("ray://localhost:10001") directly on the head node, and see if you can submit a task from there?

Hi @ckw017 , Thanks for replying
I’m running the client from client pod inside the k8s cluster it’s be able to connect to head pod but unable to submit remote function.
I tried ray.init(“ray://localhost:10001”) within the head still the same result.
here is the log from ray_client_server.err:

2022-07-26 17:36:38,393	INFO proxier.py:649 -- New data connection from client 7c272f6de4b74d8a8ddbce3cc4720304: 
2022-07-26 17:36:39,426	INFO proxier.py:340 -- SpecificServer started on port: 23002 with PID: 654 for client: 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:09,428	ERROR proxier.py:378 -- Timeout waiting for channel for 7c272f6de4b74d8a8ddbce3cc4720304
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py", line 374, in get_channel
    timeout=CHECK_CHANNEL_TIMEOUT_S
  File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 140, in result
    self._block(timeout)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 86, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-26 17:37:09,428	ERROR proxier.py:378 -- Timeout waiting for channel for 7c272f6de4b74d8a8ddbce3cc4720304
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py", line 374, in get_channel
    timeout=CHECK_CHANNEL_TIMEOUT_S
  File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 140, in result
    self._block(timeout)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 86, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-26 17:37:09,428	WARNING proxier.py:756 -- Retrying Logstream connection. 1 attempts failed.
2022-07-26 17:37:09,429	ERROR proxier.py:671 -- Channel not found for 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:23,683	INFO proxier.py:390 -- Specific server 7c272f6de4b74d8a8ddbce3cc4720304 is no longer running, freeing its port 23002
2022-07-26 17:37:39,429	INFO proxier.py:722 -- 7c272f6de4b74d8a8ddbce3cc4720304 last started stream at 1658856998.3927054. Current stream started at 1658856998.3927054.
2022-07-26 17:37:41,431	ERROR proxier.py:378 -- Timeout waiting for channel for 7c272f6de4b74d8a8ddbce3cc4720304
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/util/client/server/proxier.py", line 374, in get_channel
    timeout=CHECK_CHANNEL_TIMEOUT_S
  File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 140, in result
    self._block(timeout)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_utilities.py", line 86, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-07-26 17:37:41,432	WARNING proxier.py:756 -- Retrying Logstream connection. 2 attempts failed.
2022-07-26 17:37:43,434	ERROR proxier.py:349 -- Unable to find channel for client: 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:43,434	WARNING proxier.py:756 -- Retrying Logstream connection. 3 attempts failed.
2022-07-26 17:37:45,436	ERROR proxier.py:349 -- Unable to find channel for client: 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:45,437	WARNING proxier.py:756 -- Retrying Logstream connection. 4 attempts failed.
2022-07-26 17:37:47,439	ERROR proxier.py:349 -- Unable to find channel for client: 7c272f6de4b74d8a8ddbce3cc4720304
2022-07-26 17:37:47,439	WARNING proxier.py:756 -- Retrying Logstream connection. 5 attempts failed.

I’m with the same ray& python version in both client\head and worker pods
any ideas how to resolve this issue?

Interesting, I’ll see if I can repro on Kuberay. Just for reference is this the image you’re using?: Docker Hub

And what version of Kubernetes is your cluster?

k8s rev: v1.19.16.
I’m using my on internal image and I’m install on it ray-x-x-x-py36 wheel(it’s works fine with ray 1.6.0 my problem starts with version higher then 1.10.0)
the image that you send is work as expected and this is my problem I’m trying to figure out where I have conflicts.
Is it at the level of a certain package or in the python version? why it’s work with ray 1.6 but not with 1.13.0 (my internal image is build with python 3.6.9)

Hmm, can you try doing this on the head node: python -m ray.util.client.server --port=10005 --mode specific-server and sharing the output? If something is failing there it might be related to dependency conflict

So I ran python -m ray.util.client.server --port=10005 --mode specific-server and the output is:

2022-08-02 04:37:44,413 INFO server.py:843 – Starting Ray Client server on 0.0.0.0:10005
2022-08-02 04:37:49,433 INFO server.py:890 – 25 idle checks before shutdown.
2022-08-02 04:37:54,440 INFO server.py:890 – 20 idle checks before shutdown.
2022-08-02 04:37:59,450 INFO server.py:890 – 15 idle checks before shutdown.
2022-08-02 04:38:04,460 INFO server.py:890 – 10 idle checks before shutdown.
2022-08-02 04:38:09,470 INFO server.py:890 – 5 idle checks before shutdown.
seems that noting failed but still can’t execute ray.init(“ray://localhost:10001”) from head