I am able to connect to my remote kuberay cluster through ray job api and the network connectivity to the head pod looks ok. But I cannot connect to the ray cluster using ray.init(address='ray://raycluster-default-head-svc.ray-system.svc.cluster.local:10001') or using ray.init(address=f'ray://{HEAD_POD_IP}:10001') from a jupyterhub notebook running in the same k8s cluster (different namespace)
I have seen examples in docs and other posts mentioning this should work, but I am not sure what I am missing.
I have a ray cluster running in rays-system namespace, and have a jupyterhub running in another namespace in the same k8s cluster. But I cannot connect to the ray cluster using ray.init("ray://raycluster-default-head-svc.ray-system.svc.cluster.local:10001")
However this works from a terminal of that notebook server
Also, where can I read more about the ray:// protocol? Maybe a little more understanding of the protocol will help debug such scenarios.From the stack trace I get on connection failure I see def _can_reconnect(self, e: grpc.RpcError) -> bool: , which makes me think this is grpc. But I dont see anything documented for this protocol, so not 100% sure.
Update: I had ray v2.3.0 on client and ray v2.2.0 image on the cluster. I changed the client version to v2.2.0 to match the cluster, but I still get the same error
I was finally able to solve this, putting the solution here in case someone else runs into this in future.
TL;DR If you are using Istio mesh with STRICT mTLS, you need to DISABLE it for ray service ports 8265 and 10001.
In my case I had disabled the mTLS for port 8265 and that is how I was able to reach the ray cluster from notebook terminal (through curl). The part that confused me was that I was able to telnet to port 10001 from the terminal, which made me think I can reach that port without disabling mTLS. But that is a wrong test and conclusion in the hindsight. When I disabled the mTLS for both ports, everything worked fine.
Here is what my PeerAuthentication definition looks like