Kubernetes cluster only creates head node

cloudhaxor · June 1, 2022, 10:22pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi, I followed the guide on Deploying on Kubernetes — Ray 1.12.1 for deployment on Kubernetes. Everything goes well with no issues. When executing the command kubectl -n ray get pods it only shows the head node in “Running” status. I am using an unmodified version of the config yaml file. I also replicated the commands from top to bottom on that page to make sure its not an issue. Any ideas?

Thanks

Alex · June 1, 2022, 10:29pm

Hey, can you confirm that your kubernetes cluster has enough available resources to run the ray cluster?

Could you also share the logs from your operator? (kubectl -n ray logs ray-operator)

cloudhaxor · June 1, 2022, 10:46pm

I cannot share all of the logs of that file but the only error I see is

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"

The resources on all of the nodes are more than enough to run Ray (RTX Titan/ modern xeon processor)

The network is heavily firewalled so I am unsure if that may be the issue (blocked ports?)

Everything from the install page shows “Running”

Alex · June 2, 2022, 5:35pm

That sounds like a likely/possible cause. Have you had a chance to check out the port configuration page?

https://docs.ray.io/en/latest/ray-core/configure.html#ray-ports

cloudhaxor · June 2, 2022, 8:33pm

So this did work, but only for the static non-autoscaling version of Ray. I have it up and running without Ray operator. Any recommendations for the Ray operator version?

Also side question:
Should the #of replicas be equal to the number of nodes?

I ask because the replicas launch on the same node unless there a large number of replicas (i.e set them to 70 and they disperse unequally)

Alex · June 6, 2022, 5:37pm

What happens on the version with the ray operator? Are there any error messages?

cloudhaxor · June 6, 2022, 6:56pm

No errors on that.

I have switched to KubeRay per some recommendations through similar issues in GitHub. Now, with KubeRay everything deploys but the worker seems to be stuck in the PodInitizalization stage with no errors. Any ideas?

Alex · June 6, 2022, 9:38pm

can you kubectl describe pod on the pod that stuck? the events there should provide some clues.

cloudhaxor · June 7, 2022, 4:01pm

Yes, so this led me down a rabbit hole yesterday which has given me a lot more success! But, now I have another issue and I’m unsure how to move forward.

Turns out there was a coredns issue with the kubernetes cluster. It was unable to resolve nameservers and so busybox would loop infinitely. I fixed this issue and nameservers can be resolved. Now, when ray container spins up it attempts to connect to raycluster-complete-head-svc:6379. Somehow, it is unable to resolve the underlying IP address, and server coredns errors popup like
[ERROR] plugin/errors: 2 raycluster-complete-head-svc. A: read udp → i/o timeout.
I’ve tried troubleshooting this online, but it’s not working.
One thing I noticed is that if I hard code the IP address of the service within the ray start command, the pods spin up and actually connect to the head. So the issue is most definitely related to CoreDNS and Ray start command.

cloudhaxor · June 7, 2022, 8:00pm

So I have solved the issue:

firewall-cmd --add-masquerade --permanent

and restarting the firewall fixed the issue.
Ray now works as intended.

Dmitri · June 7, 2022, 9:32pm

Glad you were able to work this out – all the errors you’ve described are consistent with firewalls in your K8s cluster. We’ll keep in mind to clearly document what kind of network communication Ray-on-K8s components are doing.

Dmitri · June 7, 2022, 9:38pm

github.com/ray-project/ray

[Ray clusters][Ray core][Docs][K8s] Document network communication of Ray components clearly.

opened 09:38PM - 07 Jun 22 UTC

DmitriGekhtman

P1 docs

### Description Related to https://github.com/ray-project/ray/issues/24491. …The docs should have clear explanations of the network details of how Ray components interact. In particular, these explanations and how they relate to K8s cluster setup should be clear for K8s users. See https://discuss.ray.io/t/kubernetes-cluster-only-creates-head-node/6366/9 ### Link _No response_

Topic		Replies	Views
Worker pod can not be started Ray Clusters	11	781	August 7, 2022
Multiple head nodes on kubernetes Kubernetes	2	905	February 25, 2021
(k8s) Ray Operator + Ray Client example seems to not use all pods Kubernetes	1	442	June 25, 2021
Some questions about Ray on Kubernetes Ray Clusters	3	762	December 3, 2021
Connecting to google kubernetes cluster from local machine Kubernetes	0	445	October 26, 2022

Kubernetes cluster only creates head node

Related topics