Access ray cluster (kuberay) interactively using `ray.init("ray://")`

I am able to connect to my remote kuberay cluster through ray job api and the network connectivity to the head pod looks ok. But I cannot connect to the ray cluster using ray.init(address='ray://raycluster-default-head-svc.ray-system.svc.cluster.local:10001') or using ray.init(address=f'ray://{HEAD_POD_IP}:10001') from a jupyterhub notebook running in the same k8s cluster (different namespace)

I have seen examples in docs and other posts mentioning this should work, but I am not sure what I am missing.

I have a ray cluster running in rays-system namespace, and have a jupyterhub running in another namespace in the same k8s cluster. But I cannot connect to the ray cluster using ray.init("ray://raycluster-default-head-svc.ray-system.svc.cluster.local:10001")

However this works from a terminal of that notebook server

ray job submit --address http://raycluster-default-head-svc.ray-system.svc.cluster.local:8265 -- python -c "import ray; ray.init('auto'); print(ray.cluster_resources())"

So I am able to connect to the head service using http (on port 8265), but connecting using ray:// protocol (on port 10001) does not work.

I tried using pod-ip of the head service (instead of kubedns name) but it has same behavior.

Also, where can I read more about the ray:// protocol? Maybe a little more understanding of the protocol will help debug such scenarios.From the stack trace I get on connection failure I see def _can_reconnect(self, e: grpc.RpcError) -> bool: , which makes me think this is grpc. But I dont see anything documented for this protocol, so not 100% sure.

Update: I had ray v2.3.0 on client and ray v2.2.0 image on the cluster. I changed the client version to v2.2.0 to match the cluster, but I still get the same error

I also confirmed the network connectivity to the pod and required ports using pod-ip of the head service.

$ export RAY_CLUSTER_NS="ray-system"
$ export head_pod=$(kubectl -n $RAY_CLUSTER_NS get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
$ echo $head_pod
raycluster-default-head-6gxzc

$ export head_pod_ip=$(kubectl -n $RAY_CLUSTER_NS get pod $head_pod --template '{{.status.podIP}}')
$ echo $head_pod_ip
10.184.16.172

$ curl -s -I $head_pod_ip:8265
HTTP/1.1 200 OK
Content-Type: text/html
Etag: "172ea4bfeba1f000-1be"
Last-Modified: Wed, 07 Dec 2022 22:54:16 GMT
Content-Length: 446
Accept-Ranges: bytes
Date: Sat, 25 Feb 2023 17:54:49 GMT
Server: Python/3.8 aiohttp/3.8.3

$ curl telnet://$head_pod_ip:8265 -v
*   Trying 10.184.16.172:8265...
* Connected to 10.184.16.172 (10.184.16.172) port 8265 (#0)

^C

$ curl telnet://$head_pod_ip:10001 -v
*   Trying 10.184.16.172:10001...
* Connected to 10.184.16.172 (10.184.16.172) port 10001 (#0)

* Closing connection 0

I was finally able to solve this, putting the solution here in case someone else runs into this in future.

TL;DR If you are using Istio mesh with STRICT mTLS, you need to DISABLE it for ray service ports 8265 and 10001.

In my case I had disabled the mTLS for port 8265 and that is how I was able to reach the ray cluster from notebook terminal (through curl). The part that confused me was that I was able to telnet to port 10001 from the terminal, which made me think I can reach that port without disabling mTLS. But that is a wrong test and conclusion in the hindsight. When I disabled the mTLS for both ports, everything worked fine.

Here is what my PeerAuthentication definition looks like

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: raycluster-default-head-peer-auth
  namespace: ray-system
spec:
  selector:
    matchLabels:
      ray.io/node-type: head
  mtls:
    mode: UNSET
  portLevelMtls:
    8265:
      mode: DISABLE
    10001:
      mode: DISABLE

Hope this saves someone some time. These things are not easy to debug.

1 Like