Unable to create cluster in kubernetes namespace


I have a Kubernetes cluster with version as:

kubectl version

Client Version: version.Info{Major:“1”, Minor:“16”, GitVersion:“v1.16.1”, GitCommit:“d647ddbd755faf07169599a625faf302ffc34458”, GitTreeState:“clean”, BuildDate:“2019-10-02T23:49:20Z”, GoVersion:“go1.12.9”, Compiler:“gc”, Platform:“darwin/amd64”}

Server Version: version.Info{Major:“1”, Minor:“19”, GitVersion:“v1.19.8+IKS”, GitCommit:“2051f2b131d2ba8ca584e7734e8c5284dac3630d”, GitTreeState:“clean”, BuildDate:“2021-02-24T04:17:23Z”, GoVersion:“go1.15.8”, Compiler:“gc”, Platform:“linux/amd64”}

I am trying to deploy the ray cluster on Kubernetes by following the instructions mentioned on below link:


I am trying to launch the operator but it is failing:

kubectl get pods
ray-operator-pod 0/1 CrashLoopBackOff 5 5m

Here is the log:

$ kubectl logs ray-operator-pod
Traceback (most recent call last):
File “/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py”, line 150, in main
for event in cluster_cr_stream:
File “/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/watch/watch.py”, line 157, in stream
resp = func(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/custom_objects_api.py”, line 2113, in list_namespaced_custom_object
return self.list_namespaced_custom_object_with_http_info(group, version, namespace, plural, **kwargs) # noqa: E501
File “/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/custom_objects_api.py”, line 2258, in list_namespaced_custom_object_with_http_info
File “/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py”, line 353, in call_api
_preload_content, _request_timeout, _host)
File “/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py”, line 184, in __call_api
File “/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py”, line 377, in request
File “/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py”, line 243, in GET
File “/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py”, line 233, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({‘Cache-Control’: ‘no-cache, private’, ‘Content-Type’: ‘text/plain; charset=utf-8’, ‘X-Content-Type-Options’: ‘nosniff’, ‘Date’: ‘Fri, 26 Mar 2021 22:37:23 GMT’, ‘Content-Length’: ‘19’})
HTTP response body: b’404 page not found\n’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/ray/anaconda3/bin/ray-operator”, line 8, in
File “/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py”, line 158, in main
“Caught a 404 error. Has the RayCluster CRD been created?”)
Exception: Caught a 404 error. Has the RayCluster CRD been created?

Can you please help?

Thanks for your question!

Before running the Operator, apply the custom resource definition (CRD) for the resources the operator acts on:

Thanks, the ray CRD command hanged for me:
kubectl apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml

I did wait for 5 mins or so as I was expecting below output at the terminal but did not get any

customresourcedefinition.apiextensions.k8s.io/rayclusters.cluster.ray.io created

I went ahead to create the cluster and I do see that pods are running:

kubectl -n ray get pods
example-cluster-ray-head-2cn9m 1/1 Running 0 3m24s
example-cluster-ray-worker-f247z 1/1 Running 0 2m39s
example-cluster-ray-worker-fwlk4 1/1 Running 0 2m39s
ray-operator-pod 1/1 Running 559 2d

kubectl -n ray get rayclusters
example-cluster Running 5m41s

To check logs I run the following command but I do not get any output back on my terminal:

kubectl -n ray logs ray-operator-pod | grep ^example-cluster2: | tail -n 100

If I recall correctly, the documentation suggests creating two Ray clusters, one named example-cluster, launched with command

$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml

and another named example-cluster2 launched with command

$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster2.yaml

From the output you’ve pasted above, it looks like the first command has executed correctly and the first cluster has been launched.

You can extract the last 100 lines of its logs with
kubectl -n ray logs ray-operator-pod | grep ^example-cluster: | tail -n 100

It looks like the second cluster example-cluster2 was not launched – maybe you didn’t run the command to start it?

Yes, I did not run the example-cluster2.

I am trying to convey that

  1. kubectl apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml hanged for me and “did not” produce output the below output :

customresourcedefinition.apiextensions.k8s.io/rayclusters.cluster.ray.io created

I can get the logs from the cluster, now that the cluster is up and running

Thank you for the prompt response!

It’s weird that the command didn’t finish executing – evidently the CRD was created, but for some the command didn’t return.

If you delete the CRD –
kubectl delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml
and re-create it
kubectl apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml
does the problem persist? Does the command hang again?

If so, you could try running the following command in another shell:
kubectl describe crd raycluster
– the Status field at the end of the output could have some useful info.

ok not sure about the reason for the hang earlier but everything works now, thank you for all your help!

Accepted Names:
Kind: RayCluster
List Kind: RayClusterList
Plural: rayclusters
Singular: raycluster
Last Transition Time: 2021-03-29T13:38:13Z
Message: no conflicts found
Reason: NoConflicts
Status: True
Type: NamesAccepted
Last Transition Time: 2021-03-29T13:38:13Z
Message: the initial names have been accepted
Reason: InitialNamesAccepted
Status: True
Type: Established
Stored Versions:

1 Like