Autoscaler issues with the K8 Operator

ecorro · February 26, 2021, 2:19am

Hi,

I deployed the operator following the procedure from the Ray page Deploying on Kubernetes — Ray v2.0.0.dev0

Here what I did and the issues I found at the end where it looks like the auto scaler is not properly configured by the operator. Any guidance on how to solve this issue would be appreciated

List pods

kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
ray-operator-pod 1/1 Running 0 3h58m

There are no worker/master pods listed

List the logs

kubectl -n ray logs ray-operator-pod | grep ^example-cluster:
example-cluster:2021-02-25 14:05:57,129 DEBUG config.py:83 – Updating the resources of node type head-node to include {‘CPU’: 1, ‘GPU’: 0}.
example-cluster:2021-02-25 14:05:57,129 DEBUG config.py:83 – Updating the resources of node type worker-node to include {‘CPU’: 1, ‘GPU’: 0}.
example-cluster:2021-02-25 14:05:57,153 WARNING config.py:164 – KubernetesNodeProvider: not checking if namespace ‘ray’ exists
example-cluster:2021-02-25 14:05:57,153 INFO config.py:184 – KubernetesNodeProvider: no autoscaler_service_account config provided, must already exist
example-cluster:2021-02-25 14:05:57,153 INFO config.py:210 – KubernetesNodeProvider: no autoscaler_role config provided, must already exist
example-cluster:2021-02-25 14:05:57,153 INFO config.py:236 – KubernetesNodeProvider: no autoscaler_role_binding config provided, must already exist
example-cluster:2021-02-25 14:05:57,154 INFO config.py:269 – KubernetesNodeProvider: no services config provided, must already exist
example-cluster:2021-02-25 14:05:57,161 INFO node_provider.py:114 – KubernetesNodeProvider: calling create_namespaced_pod (count=1).

rliaw · February 26, 2021, 2:21am

@Dmitri take a look?

Dmitri · February 26, 2021, 3:01am

The warning messages from KubernetesNodeProvider are meaningless and need to be removed.

Updating the resources of node type head-node to include {‘CPU’: 1, ‘GPU’: 0}.

GPU:0 is a bug that was fixed a couple of weeks ago, suggesting that the image being pulled for the operator pod is not fresh (rayproject/ray:nightly in operator.yaml)

That’s extremely strange given that the imagePullPolicy is set to Always in that file.

Can you provide more information about your Kubernetes setup?

ecorro · February 26, 2021, 4:54pm

It is a private cloud based on VMware’s CNCF compliant K8 distribution. The details are shown below. I’m unable to pull the image directly from Docker hub as I need a local Harbor registry. I’ll put the latest Ray image on my laptop and update the Harbor registry so I try again with the latest image. Thanks Dimitri.

kubectl version -o json
{
“clientVersion”: {
“major”: “1”,
“minor”: “19”,
“gitVersion”: “v1.19.3”,
“gitCommit”: “1e11e4a2108024935ecfcb2912226cedeafd99df”,
“gitTreeState”: “clean”,
“buildDate”: “2020-10-14T12:50:19Z”,
“goVersion”: “go1.15.2”,
“compiler”: “gc”,
“platform”: “darwin/amd64”
},
“serverVersion”: {
“major”: “1”,
“minor”: “16”,
“gitVersion”: “v1.16.14+vmware.1”,
“gitCommit”: “bbf52cd6cf83a50507e943b717ed321d383a37b5”,
“gitTreeState”: “clean”,
“buildDate”: “2020-08-19T23:48:38Z”,
“goVersion”: “go1.13.15”,
“compiler”: “gc”,
“platform”: “linux/amd64”
}
}

ecorro · February 27, 2021, 3:05am

I made the following changes:

Updated the K8 version to v1.8
kubectl version
Client Version: version.Info{Major:“1”, Minor:“19”, GitVersion:“v1.19.3”, GitCommit:“1e11e4a2108024935ecfcb2912226cedeafd99df”, GitTreeState:“clean”, BuildDate:“2020-10-14T12:50:19Z”, GoVersion:“go1.15.2”, Compiler:“gc”, Platform:“darwin/amd64”}
Server Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.15+vmware.1”, GitCommit:“9a9f80f2e0b85ce6280dd9b9f1e952a7dbf49087”, GitTreeState:“clean”, BuildDate:“2021-01-19T22:59:52Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”}
I updated my registry with the latest Ray image. As you mentioned, the warning messages are gone.
However, I have the same problem: only the operator shows up when asking about the pods.
kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
ray-operator-pod 1/1 Running 0 5h

If you’d like to take a look, I’m adding a dropbox lino to the folder with the 3 .yaml files I used to deploy Ray. They got extracted from the current Ray’s master Git branch. The only modification is that I increased the memory from 512Mi to 1024Mi. Thanks.

ecorro · March 1, 2021, 11:53pm

Update. I used these mentioned .yaml files to deeply Ray on MicroK8 and they work fine. Something might be not working well in my private K8 cluster. I’m investigating.

ecorro · March 2, 2021, 5:05pm

Update, the fix was to create the following RBAC:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: kubeapps-psp
spec:
privileged: true
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: RunAsAny
volumes:

‘*’

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: kubeapps-clusterrole
rules:

apiGroups:
- policy
  resources:
- podsecuritypolicies
  verbs:
- use
  resourceNames:
- kubeapps-psp

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kubeapps-clusterrole
namespace: ray
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubeapps-clusterrole
subjects:

apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:serviceaccounts
kind: ServiceAccount
name: default
namespace: ray

Dmitri · March 2, 2021, 5:23pm

Cool – thanks for sharing the fix!!

Would you mind sharing what was going wrong and how this fixed the problem?

ecorro · March 2, 2021, 10:20pm

It was a permission issue. At Tanzu (VMware’s K8 distribution), certain permission restrictions get enforced in the K8 guest clusters that regular users can create. Due to this restriction, the head and worker pods were not created but only the operator pod. I saved the role spec I shared in this post in a .yaml file and applied it to the ray namespace. After that, I applied the operator.yaml and the example_cluster.yaml files from the Ray Git repo, and I got the head pod (1) and worker pods (2) up and running. Thanks a lot Dimitri!

Topic		Replies	Views
Autoscaler doesn't scale workers on K8s	5	692	February 15, 2021
Autoscaler not scaling up the worker node when using image rayproject/ray:1.11.0-py38 Kubernetes	3	892	July 2, 2022
Autoscaler SDK request_resoures fails on EKS Kubernetes	8	584	February 16, 2021
Testing autoscaler Kubernetes	15	1547	March 16, 2021
K8s metacontroller Kubernetes	3	474	January 27, 2021

Autoscaler issues with the K8 Operator

List pods

There are no worker/master pods listed

List the logs

Related topics