How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
Hi, I followed the guide on Deploying on Kubernetes — Ray 1.12.1 for deployment on Kubernetes. Everything goes well with no issues. When executing the command kubectl -n ray get pods it only shows the head node in “Running” status. I am using an unmodified version of the config yaml file. I also replicated the commands from top to bottom on that page to make sure its not an issue. Any ideas?
I cannot share all of the logs of that file but the only error I see is
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
The resources on all of the nodes are more than enough to run Ray (RTX Titan/ modern xeon processor)
The network is heavily firewalled so I am unsure if that may be the issue (blocked ports?)
So this did work, but only for the static non-autoscaling version of Ray. I have it up and running without Ray operator. Any recommendations for the Ray operator version?
Also side question:
Should the #of replicas be equal to the number of nodes?
I ask because the replicas launch on the same node unless there a large number of replicas (i.e set them to 70 and they disperse unequally)
I have switched to KubeRay per some recommendations through similar issues in GitHub. Now, with KubeRay everything deploys but the worker seems to be stuck in the PodInitizalization stage with no errors. Any ideas?
Yes, so this led me down a rabbit hole yesterday which has given me a lot more success! But, now I have another issue and I’m unsure how to move forward.
Turns out there was a coredns issue with the kubernetes cluster. It was unable to resolve nameservers and so busybox would loop infinitely. I fixed this issue and nameservers can be resolved. Now, when ray container spins up it attempts to connect to raycluster-complete-head-svc:6379. Somehow, it is unable to resolve the underlying IP address, and server coredns errors popup like
[ERROR] plugin/errors: 2 raycluster-complete-head-svc. A: read udp → i/o timeout.
I’ve tried troubleshooting this online, but it’s not working.
One thing I noticed is that if I hard code the IP address of the service within the ray start command, the pods spin up and actually connect to the head. So the issue is most definitely related to CoreDNS and Ray start command.
Glad you were able to work this out – all the errors you’ve described are consistent with firewalls in your K8s cluster. We’ll keep in mind to clearly document what kind of network communication Ray-on-K8s components are doing.