Autoscaler issues with the K8 Operator

Hi,

I deployed the operator following the procedure from the Ray page Deploying on Kubernetes — Ray v2.0.0.dev0

Here what I did and the issues I found at the end where it looks like the auto scaler is not properly configured by the operator. Any guidance on how to solve this issue would be appreciated

List pods

kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
ray-operator-pod 1/1 Running 0 3h58m

There are no worker/master pods listed

List the logs

kubectl -n ray logs ray-operator-pod | grep ^example-cluster:
example-cluster:2021-02-25 14:05:57,129 DEBUG config.py:83 – Updating the resources of node type head-node to include {‘CPU’: 1, ‘GPU’: 0}.
example-cluster:2021-02-25 14:05:57,129 DEBUG config.py:83 – Updating the resources of node type worker-node to include {‘CPU’: 1, ‘GPU’: 0}.
example-cluster:2021-02-25 14:05:57,153 WARNING config.py:164 – KubernetesNodeProvider: not checking if namespace ‘ray’ exists
example-cluster:2021-02-25 14:05:57,153 INFO config.py:184 – KubernetesNodeProvider: no autoscaler_service_account config provided, must already exist
example-cluster:2021-02-25 14:05:57,153 INFO config.py:210 – KubernetesNodeProvider: no autoscaler_role config provided, must already exist
example-cluster:2021-02-25 14:05:57,153 INFO config.py:236 – KubernetesNodeProvider: no autoscaler_role_binding config provided, must already exist
example-cluster:2021-02-25 14:05:57,154 INFO config.py:269 – KubernetesNodeProvider: no services config provided, must already exist
example-cluster:2021-02-25 14:05:57,161 INFO node_provider.py:114 – KubernetesNodeProvider: calling create_namespaced_pod (count=1).

@Dmitri take a look?

The warning messages from KubernetesNodeProvider are meaningless and need to be removed.

Updating the resources of node type head-node to include {‘CPU’: 1, ‘GPU’: 0}.

GPU:0 is a bug that was fixed a couple of weeks ago, suggesting that the image being pulled for the operator pod is not fresh (rayproject/ray:nightly in operator.yaml)

That’s extremely strange given that the imagePullPolicy is set to Always in that file.

Can you provide more information about your Kubernetes setup?

It is a private cloud based on VMware’s CNCF compliant K8 distribution. The details are shown below. I’m unable to pull the image directly from Docker hub as I need a local Harbor registry. I’ll put the latest Ray image on my laptop and update the Harbor registry so I try again with the latest image. Thanks Dimitri.

kubectl version -o json
{
“clientVersion”: {
“major”: “1”,
“minor”: “19”,
“gitVersion”: “v1.19.3”,
“gitCommit”: “1e11e4a2108024935ecfcb2912226cedeafd99df”,
“gitTreeState”: “clean”,
“buildDate”: “2020-10-14T12:50:19Z”,
“goVersion”: “go1.15.2”,
“compiler”: “gc”,
“platform”: “darwin/amd64”
},
“serverVersion”: {
“major”: “1”,
“minor”: “16”,
“gitVersion”: “v1.16.14+vmware.1”,
“gitCommit”: “bbf52cd6cf83a50507e943b717ed321d383a37b5”,
“gitTreeState”: “clean”,
“buildDate”: “2020-08-19T23:48:38Z”,
“goVersion”: “go1.13.15”,
“compiler”: “gc”,
“platform”: “linux/amd64”
}
}

I made the following changes:

  1. Updated the K8 version to v1.8
    kubectl version
    Client Version: version.Info{Major:“1”, Minor:“19”, GitVersion:“v1.19.3”, GitCommit:“1e11e4a2108024935ecfcb2912226cedeafd99df”, GitTreeState:“clean”, BuildDate:“2020-10-14T12:50:19Z”, GoVersion:“go1.15.2”, Compiler:“gc”, Platform:“darwin/amd64”}
    Server Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.15+vmware.1”, GitCommit:“9a9f80f2e0b85ce6280dd9b9f1e952a7dbf49087”, GitTreeState:“clean”, BuildDate:“2021-01-19T22:59:52Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”}

  2. I updated my registry with the latest Ray image. As you mentioned, the warning messages are gone.

  3. However, I have the same problem: only the operator shows up when asking about the pods.
    kubectl -n ray get pods
    NAME READY STATUS RESTARTS AGE
    ray-operator-pod 1/1 Running 0 5h

If you’d like to take a look, I’m adding a dropbox lino to the folder with the 3 .yaml files I used to deploy Ray. They got extracted from the current Ray’s master Git branch. The only modification is that I increased the memory from 512Mi to 1024Mi. Thanks.

Update. I used these mentioned .yaml files to deeply Ray on MicroK8 and they work fine. Something might be not working well in my private K8 cluster. I’m investigating.

Update, the fix was to create the following RBAC:


apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: kubeapps-psp
spec:
privileged: true
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: RunAsAny
volumes:

  • ‘*’

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: kubeapps-clusterrole
rules:

  • apiGroups:
    • policy
      resources:
    • podsecuritypolicies
      verbs:
    • use
      resourceNames:
    • kubeapps-psp

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kubeapps-clusterrole
namespace: ray
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubeapps-clusterrole
subjects:

  • apiGroup: rbac.authorization.k8s.io
    kind: Group
    name: system:serviceaccounts
  • kind: ServiceAccount
    name: default
    namespace: ray
1 Like

Cool – thanks for sharing the fix!!

Would you mind sharing what was going wrong and how this fixed the problem?

It was a permission issue. At Tanzu (VMware’s K8 distribution), certain permission restrictions get enforced in the K8 guest clusters that regular users can create. Due to this restriction, the head and worker pods were not created but only the operator pod. I saved the role spec I shared in this post in a .yaml file and applied it to the ray namespace. After that, I applied the operator.yaml and the example_cluster.yaml files from the Ray Git repo, and I got the head pod (1) and worker pods (2) up and running. Thanks a lot Dimitri!