Connecting to Ray cluster on Kubernetes from outside the cluster

I was trying to follow documentation Deploying on Kubernetes — Ray v2.0.0.dev0.
Created Ray cluster on Kind cluster locally. Used both - ray up and operator approaches. Operator works great, for ray up it mostly works, but once in a while, it does not process ns creation and consequently fails. This behavior seems to be random.
After cluster is created, Using Ray Client to connect from within the Kubernetes cluster works fine. I can look at the log and see the execution.
But when I tried Using Ray Client to connect from outside the Kubernetes cluster connection always fails. I see that Handling connection for 10001, but the result is:
Traceback (most recent call last):
File “/Users/boris/Projects/RayOnKind/src/localPython.py”, line 55, in
ray.util.connect(f"127.0.0.1:{LOCAL_PORT}")
File “/usr/local/lib/python3.7/site-packages/ray/util/client_connect.py”, line 26, in connect
conn_str, secure=secure, metadata=metadata, connection_retries=3)
File “/usr/local/lib/python3.7/site-packages/ray/util/client/init.py”, line 57, in connect
connection_retries=connection_retries)
File “/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py”, line 120, in init
raise ConnectionError(“ray client connection timeout”)
ConnectionError: ray client connection timeout

To ensure Python version I am using image rayproject/ray:b3a717-py37-cpu.
which is the same as my current:

python3 --version
Python 3.7.10

When I am running the same think in the notebook, I get a bit more info:

2021-03-23 10:45:49,624 INFO worker.py:113 – Waiting for Ray to become ready on the server, retry in 5s…
2021-03-23 10:45:59,633 INFO worker.py:113 – Waiting for Ray to become ready on the server, retry in 10s…
2021-03-23 10:46:14,642 INFO worker.py:113 – Waiting for Ray to become ready on the server, retry in 15s…

ConnectionError Traceback (most recent call last)
in
1 #ray.init(ignore_reinit_error=True)
2 #ray.init(address=‘127.0.0.1:10001’, _redis_password=‘5241590000000000’)
----> 3 ray.util.connect(‘127.0.0.1:10001’)

/usr/local/lib/python3.7/site-packages/ray/util/client_connect.py in connect(conn_str, secure, metadata, connection_retries)
24 # the correct metadata
25 return ray.connect(
—> 26 conn_str, secure=secure, metadata=metadata, connection_retries=3)
27
28

/usr/local/lib/python3.7/site-packages/ray/util/client/init.py in connect(self, conn_str, secure, metadata, connection_retries)
55 secure=secure,
56 metadata=metadata,
—> 57 connection_retries=connection_retries)
58 self.api.worker = self.client_worker
59 return self.client_worker.connection_info()

/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py in init(self, conn_str, secure, metadata, connection_retries)
118 # up our retries and should error back to the user.
119 if not ray_ready:
→ 120 raise ConnectionError(“ray client connection timeout”)
121
122 # Initialize the streams to finish protocol negotiation.

ConnectionError: ray client connection timeout

Hey @Dmitri and @Ameer_Haj_Ali, could you take a look at this?

Can it be because pip install ray install ray 1.2, while the cluster is using much newere version?

And after reinstalling Ray to v2 the problem is fixed

1 Like