ConnectionError: ray client connection timeout, Ray 1.9.0, Kubernetes

I have deployed Ray cluster on kubernetes on local machine and trying to connect to it via another pod(running business logic + models).

I am using ray.init("ray://example-cluster-ray-head:10001", namespace="ray")

and getting below stacktrace

ray.init("ray://example-cluster-ray-head:10001", namespace="ray")
File "/usr/local/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
  return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/ray/worker.py", line 775, in init
  return builder.connect()
File "/usr/local/lib/python3.9/site-packages/ray/client_builder.py", line 151, in connect
  client_info_dict = ray.util.client_connect.connect(
File "/usr/local/lib/python3.9/site-packages/ray/util/client_connect.py", line 33, in connect
  conn = ray.connect(
File "/usr/local/lib/python3.9/site-packages/ray/util/client/__init__.py", line 228, in connect
  conn = self.get_context().connect(*args, **kw_args)
File "/usr/local/lib/python3.9/site-packages/ray/util/client/__init__.py", line 81, in connect
  self.client_worker = Worker(
File "/usr/local/lib/python3.9/site-packages/ray/util/client/worker.py", line 130, in __init__
  self._connect_channel()
File "/usr/local/lib/python3.9/site-packages/ray/util/client/worker.py", line 244, in _connect_channel
  raise ConnectionError("ray client connection timeout")
ConnectionError: ray client connection timeout

Hi @lihost can you provide a full script for producing this issue ? In addition you might be able to reproduce it on your laptop by manually starting ray cluster ray start --head then calls ray.init("ray://127.0.0.1:10001", namespace="ray") with your script.

It looks similar to another P0 issue that we’re addressing now: [Bug] [Serve] Ray hangs on API methods · Issue #20971 · ray-project/ray · GitHub where it can also get stuck and timeout on second iteration of init_ray() in the script.

Hi @jiaodong, thanks for your response.

I ran it manually again and its working fine with ray start --head but while deploying it on kubernetes, its throwing this error.

I have followed steps mentioned at Deploying on Kubernetes — Ray v2.0.0.dev0 to setup ray cluster within Kubernetes.

For running ray.server and ray.remote, I am following above mentioned guide’s subsection i.e. using-ray-client-to-connect-from-within-the-kubernetes-cluster.

Hope docs haven’t changed.

I see, it’s good to know that should be another category of the issue then and from your description it looks like you’re trying out our latest documentation sample script that is not working as expected. Is my understanding correct or you’re using different script for your use case other than ray/job_example.py at master · ray-project/ray · GitHub ?

cc: @architkulkarni

Yes, thats pretty much what I am trying to do as a POC for now.

@lihost what outputs do you see for executing steps in Deploying on Kubernetes — Ray v2.0.0.dev0, such as

kubectl -n ray get rayclusters

kubectl -n ray get pods

kubectl -n ray get service

kubectl get deployment ray-operator

@Dmitri has most context about our ray on k8s deployment

Thanks @jiaodong , here is what I can see…

❯ kubectl -n ray get rayclusters
NAME              STATUS    RESTARTS   AGE
example-cluster   Running   0          42h


❯ kubectl -n ray get pods
NAME                                    READY   STATUS    RESTARTS   AGE
example-cluster-ray-head-type-pnkp2     1/1     Running   0          42h
example-cluster-ray-worker-type-5pwgv   1/1     Running   0          42h
example-cluster-ray-worker-type-x2l54   1/1     Running   0          42h


❯ kubectl -n ray get service
NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
example-cluster-ray-head   ClusterIP   xxx.xxx.xxx.xxx   <none>        10001/TCP,8265/TCP,8000/TCP   4m16s


❯ kubectl get deployment ray-operator
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
ray-operator   1/1     1            1           42h


@lihost Are you still experiencing this issue?

No longer seeing this issue.