How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I followed the guide on Deploying on Kubernetes — Ray 1.12.1 for deployment on Kubernetes. Everything goes well with no issues. When executing the command
kubectl -n ray get pods it only shows the head node in “Running” status. I am using an unmodified version of the config yaml file. I also replicated the commands from top to bottom on that page to make sure its not an issue. Any ideas?
Hey, can you confirm that your kubernetes cluster has enough available resources to run the ray cluster?
Could you also share the logs from your operator? (
kubectl -n ray logs ray-operator)
I cannot share all of the logs of that file but the only error I see is
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
The resources on all of the nodes are more than enough to run Ray (RTX Titan/ modern xeon processor)
The network is heavily firewalled so I am unsure if that may be the issue (blocked ports?)
Everything from the install page shows “Running”
That sounds like a likely/possible cause. Have you had a chance to check out the port configuration page?
So this did work, but only for the static non-autoscaling version of Ray. I have it up and running without Ray operator. Any recommendations for the Ray operator version?
Also side question:
Should the #of replicas be equal to the number of nodes?
I ask because the replicas launch on the same node unless there a large number of replicas (i.e set them to 70 and they disperse unequally)
What happens on the version with the ray operator? Are there any error messages?
No errors on that.
I have switched to KubeRay per some recommendations through similar issues in GitHub. Now, with KubeRay everything deploys but the worker seems to be stuck in the PodInitizalization stage with no errors. Any ideas?
kubectl describe pod on the pod that stuck? the events there should provide some clues.
Yes, so this led me down a rabbit hole yesterday which has given me a lot more success! But, now I have another issue and I’m unsure how to move forward.
Turns out there was a coredns issue with the kubernetes cluster. It was unable to resolve nameservers and so busybox would loop infinitely. I fixed this issue and nameservers can be resolved. Now, when ray container spins up it attempts to connect to raycluster-complete-head-svc:6379. Somehow, it is unable to resolve the underlying IP address, and server coredns errors popup like
[ERROR] plugin/errors: 2 raycluster-complete-head-svc. A: read udp → i/o timeout.
I’ve tried troubleshooting this online, but it’s not working.
One thing I noticed is that if I hard code the IP address of the service within the ray start command, the pods spin up and actually connect to the head. So the issue is most definitely related to CoreDNS and Ray start command.
So I have solved the issue:
firewall-cmd --add-masquerade --permanent
and restarting the firewall fixed the issue.
Ray now works as intended.
Glad you were able to work this out – all the errors you’ve described are consistent with firewalls in your K8s cluster. We’ll keep in mind to clearly document what kind of network communication Ray-on-K8s components are doing.