Here what I did and the issues I found at the end where it looks like the auto scaler is not properly configured by the operator. Any guidance on how to solve this issue would be appreciated
List pods
kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
ray-operator-pod 1/1 Running 0 3h58m
There are no worker/master pods listed
List the logs
kubectl -n ray logs ray-operator-pod | grep ^example-cluster:
example-cluster:2021-02-25 14:05:57,129 DEBUG config.py:83 – Updating the resources of node type head-node to include {‘CPU’: 1, ‘GPU’: 0}.
example-cluster:2021-02-25 14:05:57,129 DEBUG config.py:83 – Updating the resources of node type worker-node to include {‘CPU’: 1, ‘GPU’: 0}.
example-cluster:2021-02-25 14:05:57,153 WARNING config.py:164 – KubernetesNodeProvider: not checking if namespace ‘ray’ exists
example-cluster:2021-02-25 14:05:57,153 INFO config.py:184 – KubernetesNodeProvider: no autoscaler_service_account config provided, must already exist
example-cluster:2021-02-25 14:05:57,153 INFO config.py:210 – KubernetesNodeProvider: no autoscaler_role config provided, must already exist
example-cluster:2021-02-25 14:05:57,153 INFO config.py:236 – KubernetesNodeProvider: no autoscaler_role_binding config provided, must already exist
example-cluster:2021-02-25 14:05:57,154 INFO config.py:269 – KubernetesNodeProvider: no services config provided, must already exist
example-cluster:2021-02-25 14:05:57,161 INFO node_provider.py:114 – KubernetesNodeProvider: calling create_namespaced_pod (count=1).
The warning messages from KubernetesNodeProvider are meaningless and need to be removed.
Updating the resources of node type head-node to include {‘CPU’: 1, ‘GPU’: 0}.
GPU:0 is a bug that was fixed a couple of weeks ago, suggesting that the image being pulled for the operator pod is not fresh (rayproject/ray:nightly in operator.yaml)
That’s extremely strange given that the imagePullPolicy is set to Always in that file.
Can you provide more information about your Kubernetes setup?
It is a private cloud based on VMware’s CNCF compliant K8 distribution. The details are shown below. I’m unable to pull the image directly from Docker hub as I need a local Harbor registry. I’ll put the latest Ray image on my laptop and update the Harbor registry so I try again with the latest image. Thanks Dimitri.
Updated the K8 version to v1.8
kubectl version
Client Version: version.Info{Major:“1”, Minor:“19”, GitVersion:“v1.19.3”, GitCommit:“1e11e4a2108024935ecfcb2912226cedeafd99df”, GitTreeState:“clean”, BuildDate:“2020-10-14T12:50:19Z”, GoVersion:“go1.15.2”, Compiler:“gc”, Platform:“darwin/amd64”}
Server Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.15+vmware.1”, GitCommit:“9a9f80f2e0b85ce6280dd9b9f1e952a7dbf49087”, GitTreeState:“clean”, BuildDate:“2021-01-19T22:59:52Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”}
I updated my registry with the latest Ray image. As you mentioned, the warning messages are gone.
However, I have the same problem: only the operator shows up when asking about the pods.
kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
ray-operator-pod 1/1 Running 0 5h
If you’d like to take a look, I’m adding a dropbox lino to the folder with the 3 .yaml files I used to deploy Ray. They got extracted from the current Ray’s master Git branch. The only modification is that I increased the memory from 512Mi to 1024Mi. Thanks.
Update. I used these mentioned .yaml files to deeply Ray on MicroK8 and they work fine. Something might be not working well in my private K8 cluster. I’m investigating.
It was a permission issue. At Tanzu (VMware’s K8 distribution), certain permission restrictions get enforced in the K8 guest clusters that regular users can create. Due to this restriction, the head and worker pods were not created but only the operator pod. I saved the role spec I shared in this post in a .yaml file and applied it to the ray namespace. After that, I applied the operator.yaml and the example_cluster.yaml files from the Ray Git repo, and I got the head pod (1) and worker pods (2) up and running. Thanks a lot Dimitri!