KubeRay operator keep restarting

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi Ray Team,

I started to migrate my Ray cluster from the legacy K8s operator to the KubeRay. I deployed the KubeRay via Helm Chart. I saw a strange situation in which the kuberay-operator kept restarting in a short period.

NAME                                           READY   STATUS    RESTARTS   AGE
kuberay-operator-59d4ddc7f4-9kl6s              1/1     Running   7          28m
ray-cluster-kuberay-head-xrqh2                 1/1     Running   0          28m
ray-cluster-kuberay-worker-workergroup-4wf2s   1/1     Running   0          28m

You can see the above outputs show the kuberay-operator just restarted 7 times in 28 minutes…

When the kuberay-operator down the status showed CrashLoopBackOff.

NAME                                           READY   STATUS             RESTARTS   AGE
kuberay-operator-59d4ddc7f4-9kl6s              0/1     CrashLoopBackOff   9          46m

The following logs I captured when it went down.

2022-10-03T08:29:11.092Z	INFO	setup	the operator	{"version:": ""}
2022-10-03T08:29:11.092Z	INFO	setup	Feature flag prioritize-workers-to-delete is enabled.
I1003 08:29:12.193426       1 request.go:665] Waited for 1.003580509s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/cert-manager.io/v1beta1?timeout=32s
2022-10-03T08:29:14.990Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
2022-10-03T08:29:14.990Z	INFO	setup	starting manager
2022-10-03T08:29:14.991Z	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2022-10-03T08:29:14.991Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8082"}
I1003 08:29:14.991246       1 leaderelection.go:248] attempting to acquire leader lease ray-system/ray-operator-leader...
I1003 08:29:33.085800       1 leaderelection.go:258] successfully acquired lease ray-system/ray-operator-leader
2022-10-03T08:29:33.085Z	DEBUG	events	Normal	{"object": {"kind":"ConfigMap","namespace":"ray-system","name":"ray-operator-leader","uid":"a4b18352-f4b4-41a1-a6f4-cf45b838663c","apiVersion":"v1","resourceVersion":"309279038"}, "reason": "LeaderElection", "message": "kuberay-operator-59d4ddc7f4-9kl6s_b7d9dbfe-a103-4600-94db-fc64e5931119 became leader"}
2022-10-03T08:29:33.086Z	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"ray-system","name":"ray-operator-leader","uid":"53e54feb-6db9-40da-812d-e2c9b776a722","apiVersion":"coordination.k8s.io/v1","resourceVersion":"309279040"}, "reason": "LeaderElection", "message": "kuberay-operator-59d4ddc7f4-9kl6s_b7d9dbfe-a103-4600-94db-fc64e5931119 became leader"}
2022-10-03T08:29:33.086Z	INFO	controller.raycluster-controller	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-03T08:29:33.086Z	INFO	controller.raycluster-controller	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Event"}
2022-10-03T08:29:33.086Z	INFO	controller.raycluster-controller	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Pod"}
2022-10-03T08:29:33.086Z	INFO	controller.raycluster-controller	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Service"}
2022-10-03T08:29:33.086Z	INFO	controller.raycluster-controller	Starting Controller	{"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
2022-10-03T08:29:33.086Z	INFO	controller.rayjob	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1alpha1.RayJob"}
2022-10-03T08:29:33.086Z	INFO	controller.rayjob	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-03T08:29:33.086Z	INFO	controller.rayjob	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1.Service"}
2022-10-03T08:29:33.086Z	INFO	controller.rayjob	Starting Controller	{"reconciler group": "ray.io", "reconciler kind": "RayJob"}
2022-10-03T08:29:33.087Z	INFO	controller.rayservice	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1alpha1.RayService"}
2022-10-03T08:29:33.087Z	INFO	controller.rayservice	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-03T08:29:33.087Z	INFO	controller.rayservice	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1.Service"}
2022-10-03T08:29:33.087Z	INFO	controller.rayservice	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1.Ingress"}
2022-10-03T08:29:33.087Z	INFO	controller.rayservice	Starting Controller	{"reconciler group": "ray.io", "reconciler kind": "RayService"}
I1003 08:29:33.689070       1 trace.go:205] Trace[610244577]: "DeltaFIFO Pop Process" ID:mark-lee/nogpu,Depth:36,Reason:slow event handlers blocking the queue (03-Oct-2022 08:29:33.491) (total time: 196ms):
Trace[610244577]: [196.950657ms] [196.950657ms] END
I1003 08:29:34.344988       1 request.go:665] Waited for 1.048843374s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/coordination.k8s.io/v1beta1?timeout=32s
2022-10-03T08:29:36.698Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:580
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
2022-10-03T08:29:36.699Z	INFO	controller.raycluster-controller	Starting workers	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "worker count": 1}
2022-10-03T08:29:36.699Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "ray-cluster-kuberay"}
2022-10-03T08:29:36.699Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-03T08:29:36.699Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.699Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.699Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-03T08:29:36.699Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
2022-10-03T08:29:36.712Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "kuberay-operator-59d4ddc7f4-9kl6s.171a80749ff68a31"}
2022-10-03T08:29:36.713Z	INFO	controllers.RayCluster	FT not enabled skipping event reconcile for pod.	{"pod name": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.713Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "ray-cluster-kuberay"}
2022-10-03T08:29:36.713Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-03T08:29:36.713Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.713Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.713Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-03T08:29:36.713Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
2022-10-03T08:29:36.796Z	ERROR	controllers.RayCluster	Update status error	{"cluster name": "ray-cluster-kuberay", "error": "Operation cannot be fulfilled on rayclusters.ray.io \"ray-cluster-kuberay\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile
	/workspace/controllers/ray/raycluster_controller.go:95
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-10-03T08:29:36.797Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "ray-cluster-kuberay"}
2022-10-03T08:29:36.797Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-03T08:29:36.797Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.797Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.797Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-03T08:29:36.797Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
2022-10-03T08:29:36.800Z	INFO	controller.rayservice	Starting workers	{"reconciler group": "ray.io", "reconciler kind": "RayService", "worker count": 1}
I1003 08:29:47.750384       1 request.go:665] Waited for 1.042937768s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
2022-10-03T08:29:50.204Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:29:57.788582       1 request.go:665] Waited for 1.084089703s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/install.istio.io/v1alpha1?timeout=32s
2022-10-03T08:30:00.203Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:07.801017       1 request.go:665] Waited for 1.095882529s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/cert-manager.io/v1?timeout=32s
2022-10-03T08:30:10.691Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:17.849758       1 request.go:665] Waited for 1.145043946s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/operators.coreos.com/v2?timeout=32s
2022-10-03T08:30:20.203Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:27.850297       1 request.go:665] Waited for 1.145535029s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/operators.coreos.com/v1alpha2?timeout=32s
2022-10-03T08:30:30.203Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:37.850745       1 request.go:665] Waited for 1.145105639s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/internal.autoscaling.k8s.io/v1alpha1?timeout=32s
2022-10-03T08:30:40.392Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:47.901254       1 request.go:665] Waited for 1.195687309s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/policy/v1beta1?timeout=32s
2022-10-03T08:30:50.204Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:57.951035       1 request.go:665] Waited for 1.243269823s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/management.cattle.io/v3?timeout=32s
2022-10-03T08:31:00.213Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:31:08.000058       1 request.go:665] Waited for 1.29533055s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/scheduling.k8s.io/v1?timeout=32s
2022-10-03T08:31:10.204Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:31:18.000489       1 request.go:665] Waited for 1.296084477s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/install.istio.io/v1alpha1?timeout=32s
2022-10-03T08:31:20.203Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:31:28.050329       1 request.go:665] Waited for 1.345072049s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/telemetry.istio.io/v1alpha1?timeout=32s
2022-10-03T08:31:30.493Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
2022-10-03T08:31:33.087Z	ERROR	controller.rayjob	Could not wait for Cache to sync	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "error": "failed to wait for rayjob caches to sync: timed out waiting for cache to be synced"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/manager/runnable_group.go:218
2022-10-03T08:31:33.087Z	INFO	Stopping and waiting for non leader election runnables
2022-10-03T08:31:33.087Z	INFO	Stopping and waiting for leader election runnables
2022-10-03T08:31:33.087Z	INFO	controller.rayservice	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "ray.io", "reconciler kind": "RayService"}
2022-10-03T08:31:33.087Z	INFO	controller.raycluster-controller	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
2022-10-03T08:31:33.087Z	INFO	controller.rayservice	All workers finished	{"reconciler group": "ray.io", "reconciler kind": "RayService"}
2022-10-03T08:31:33.087Z	INFO	controller.raycluster-controller	All workers finished	{"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
2022-10-03T08:31:33.087Z	INFO	Stopping and waiting for caches
2022-10-03T08:31:33.087Z	INFO	Stopping and waiting for webhooks
2022-10-03T08:31:33.087Z	INFO	Wait completed, proceeding to shutdown the manager
2022-10-03T08:31:33.087Z	ERROR	setup	problem running manager	{"error": "failed to wait for rayjob caches to sync: timed out waiting for cache to be synced"}
main.main
	/workspace/main.go:121
runtime.main
	/usr/local/go/src/runtime/proc.go:255

Some questions:

  1. Is it stop monitoring the status of the Ray cluster when the kuberay-operator go down? I noticed when the kuberay-operator went down, I couldn’t update the configuration of the Ray cluster, such as image:tag in values.yaml.
  2. I just use the default vaules.yaml. Is it the configuration issue of the livenessProbe and readinessProbe?
    livenessProbe:
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 5
    
    readinessProbe:
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 5
    

I deployed the kuberay-operator via this command.

$ helm install kuberay-operator --namespace ray-system --create-namespace $(curl -s https://api.github.com/repos/ray-project/kuberay/releases/tags/v0.3.0 | grep '"browser_download_url":' | sort | grep -om1 'https.*helm-chart-kuberay-operator.*tgz')

I deployed the ray-cluster after I cloned the whole repository.

git clone git@github.com:ray-project/kuberay.git
cd helm-chart/ray-cluster
helm install ray-cluster --namespace ray-system --create-namespace .

The commit is e77b0958b040de22c0eeb9320bff7cefed9ecd7b.

Have you deleted all resources (including CRDs and operator deployment) related to the legacy operator?

Also, is it possible you’re accidentally running multiple instances of the KubeRay operator?

Have you deleted all resources (including CRDs and operator deployment) related to the legacy operator?

hm…I did not.

$ kubectl get crd | grep ray
rayclusters.cluster.ray.io                                         2022-04-20T09:47:32Z
rayclusters.ray.io                                                 2022-09-30T08:15:58Z
rayservices.ray.io                                                 2022-09-30T08:16:00Z

Also, is it possible you’re accidentally running multiple instances of the KubeRay operator?

I checked it. It’s only one KubeRay operator in my K8s cluster.

Hi @Dmitri ,
I cleaned up the legacy operator and the CRD with this clean up instruction.

$ kubectl -n ray delete raycluster ray-cluster
$ helm -n ray uninstall ray-cluster
$ kubectl delete namespace ray
$ kubectl delete crd rayclusters.cluster.ray.io

There are only two new CRD remain.

$ kubectl get crd | grep ray
rayclusters.ray.io                                                 2022-09-30T08:15:58Z
rayservices.ray.io                                                 2022-09-30T08:16:00Z
$ kubectl get rayclusters.ray.io -A
NAMESPACE    NAME                  AGE
ray-system   ray-cluster-kuberay   3h53m

However, the kuberay-operator still keeps restarting.

NAME                                           READY   STATUS    RESTARTS   AGE
kuberay-operator-59d4ddc7f4-zb526              1/1     Running   7          28m
ray-cluster-kuberay-head-cqsff                 1/1     Running   0          26m
ray-cluster-kuberay-worker-workergroup-4nl5s   1/1     Running   0          26m

The log seems to show the same error messages

2022-10-05T09:22:49.760Z	INFO	setup	the operator	{"version:": ""}
2022-10-05T09:22:49.760Z	INFO	setup	Feature flag prioritize-workers-to-delete is enabled.
I1005 09:22:51.057793       1 request.go:665] Waited for 1.004207156s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/catalog.cattle.io/v1?timeout=32s
2022-10-05T09:22:53.661Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
2022-10-05T09:22:53.753Z	INFO	setup	starting manager
2022-10-05T09:22:53.753Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8082"}
I1005 09:22:53.754158       1 leaderelection.go:248] attempting to acquire leader lease ray-system/ray-operator-leader...
2022-10-05T09:22:53.754Z	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
I1005 09:23:10.565349       1 leaderelection.go:258] successfully acquired lease ray-system/ray-operator-leader
2022-10-05T09:23:10.565Z	DEBUG	events	Normal	{"object": {"kind":"ConfigMap","namespace":"ray-system","name":"ray-operator-leader","uid":"a4b18352-f4b4-41a1-a6f4-cf45b838663c","apiVersion":"v1","resourceVersion":"313373623"}, "reason": "LeaderElection", "message": "kuberay-operator-59d4ddc7f4-zb526_9c30208f-afc6-4c6a-9161-fff0df732714 became leader"}
2022-10-05T09:23:10.565Z	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"ray-system","name":"ray-operator-leader","uid":"53e54feb-6db9-40da-812d-e2c9b776a722","apiVersion":"coordination.k8s.io/v1","resourceVersion":"313373624"}, "reason": "LeaderElection", "message": "kuberay-operator-59d4ddc7f4-zb526_9c30208f-afc6-4c6a-9161-fff0df732714 became leader"}
2022-10-05T09:23:10.565Z	INFO	controller.rayjob	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1alpha1.RayJob"}
2022-10-05T09:23:10.566Z	INFO	controller.rayjob	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-05T09:23:10.566Z	INFO	controller.rayjob	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1.Service"}
2022-10-05T09:23:10.566Z	INFO	controller.rayjob	Starting Controller	{"reconciler group": "ray.io", "reconciler kind": "RayJob"}
2022-10-05T09:23:10.566Z	INFO	controller.rayservice	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1alpha1.RayService"}
2022-10-05T09:23:10.566Z	INFO	controller.rayservice	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-05T09:23:10.566Z	INFO	controller.rayservice	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1.Service"}
2022-10-05T09:23:10.567Z	INFO	controller.rayservice	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1.Ingress"}
2022-10-05T09:23:10.567Z	INFO	controller.rayservice	Starting Controller	{"reconciler group": "ray.io", "reconciler kind": "RayService"}
2022-10-05T09:23:10.565Z	INFO	controller.raycluster-controller	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-05T09:23:10.567Z	INFO	controller.raycluster-controller	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Event"}
2022-10-05T09:23:10.567Z	INFO	controller.raycluster-controller	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Pod"}
2022-10-05T09:23:10.567Z	INFO	controller.raycluster-controller	Starting EventSource	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Service"}
2022-10-05T09:23:10.567Z	INFO	controller.raycluster-controller	Starting Controller	{"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
I1005 09:23:11.867852       1 request.go:665] Waited for 1.015266073s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/cert-manager.io/v1beta1?timeout=32s
2022-10-05T09:23:14.058Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:580
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
2022-10-05T09:23:15.662Z	INFO	controller.rayservice	Starting workers	{"reconciler group": "ray.io", "reconciler kind": "RayService", "worker count": 1}
2022-10-05T09:23:15.752Z	INFO	controller.raycluster-controller	Starting workers	{"reconciler group": "ray.io", "reconciler kind": "RayCluster", "worker count": 1}
2022-10-05T09:23:15.753Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "ray-cluster-kuberay"}
2022-10-05T09:23:15.753Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-05T09:23:15.753Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "ray-cluster-kuberay-head-cqsff"}
2022-10-05T09:23:15.754Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "ray-cluster-kuberay-head-cqsff"}
2022-10-05T09:23:15.754Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-05T09:23:15.754Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
2022-10-05T09:23:16.056Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "kuberay-operator-59d4ddc7f4-zb526.171b208c63d06a42"}
2022-10-05T09:23:16.057Z	INFO	controllers.RayCluster	FT not enabled skipping event reconcile for pod.	{"pod name": "ray-cluster-kuberay-worker-workergroup-4nl5s"}
2022-10-05T09:23:16.152Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "gke-metadata-server-smq48.171b1fbaa00839f6"}
2022-10-05T09:23:16.155Z	INFO	controllers.RayCluster	pod not found or no valid annotations	{"pod name": "gke-metadata-server-smq48"}
2022-10-05T09:23:16.159Z	INFO	controllers.RayCluster	reconcile RayCluster Event	{"event name": "gke-metadata-server-bmxzm.171b20511163c188"}
2022-10-05T09:23:16.252Z	INFO	controllers.RayCluster	pod not found or no valid annotations	{"pod name": "gke-metadata-server-bmxzm"}
2022-10-05T09:23:16.252Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "ray-cluster-kuberay"}
2022-10-05T09:23:16.252Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-05T09:23:16.252Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "ray-cluster-kuberay-head-cqsff"}
2022-10-05T09:23:16.253Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "ray-cluster-kuberay-head-cqsff"}
2022-10-05T09:23:16.253Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-05T09:23:16.253Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
2022-10-05T09:23:16.352Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "ray-cluster-kuberay"}
2022-10-05T09:23:16.352Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-05T09:23:16.352Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "ray-cluster-kuberay-head-cqsff"}
2022-10-05T09:23:16.352Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "ray-cluster-kuberay-head-cqsff"}
2022-10-05T09:23:16.352Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-05T09:23:16.352Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
I1005 09:23:25.110622       1 request.go:665] Waited for 1.043901258s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/install.istio.io/v1alpha1?timeout=32s
2022-10-05T09:23:27.955Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:23:35.160627       1 request.go:665] Waited for 1.091405043s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/acme.cert-manager.io/v1alpha2?timeout=32s
2022-10-05T09:23:37.854Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:23:45.161359       1 request.go:665] Waited for 1.093494702s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/networking.internal.knative.dev/v1alpha1?timeout=32s
2022-10-05T09:23:47.514Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:23:55.210113       1 request.go:665] Waited for 1.143400815s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/storage.k8s.io/v1?timeout=32s
2022-10-05T09:23:57.515Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:24:05.210292       1 request.go:665] Waited for 1.142955085s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s
2022-10-05T09:24:08.062Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:24:15.210340       1 request.go:665] Waited for 1.144275086s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/autoscaling.k8s.io/v1?timeout=32s
2022-10-05T09:24:17.514Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:24:25.217501       1 request.go:665] Waited for 1.149285606s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/networking.internal.knative.dev/v1alpha1?timeout=32s
2022-10-05T09:24:27.751Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:24:35.260409       1 request.go:665] Waited for 1.191728788s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/cert-manager.io/v1alpha2?timeout=32s
2022-10-05T09:24:37.655Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:24:45.264774       1 request.go:665] Waited for 1.193090043s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/telemetry.istio.io/v1alpha1?timeout=32s
2022-10-05T09:24:47.554Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:24:55.310852       1 request.go:665] Waited for 1.241917404s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
2022-10-05T09:24:57.516Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1005 09:25:05.359533       1 request.go:665] Waited for 1.291642158s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
2022-10-05T09:25:07.854Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
	/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
2022-10-05T09:25:10.567Z	ERROR	controller.rayjob	Could not wait for Cache to sync	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "error": "failed to wait for rayjob caches to sync: timed out waiting for cache to be synced"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/manager/runnable_group.go:218
2022-10-05T09:25:10.568Z	INFO	Stopping and waiting for non leader election runnables
2022-10-05T09:25:10.568Z	INFO	Stopping and waiting for leader election runnables
2022-10-05T09:25:10.568Z	INFO	controller.raycluster-controller	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
2022-10-05T09:25:10.568Z	INFO	controller.rayservice	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "ray.io", "reconciler kind": "RayService"}
2022-10-05T09:25:10.568Z	INFO	controller.rayservice	All workers finished	{"reconciler group": "ray.io", "reconciler kind": "RayService"}
2022-10-05T09:25:10.568Z	INFO	controller.raycluster-controller	All workers finished	{"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
2022-10-05T09:25:10.568Z	INFO	Stopping and waiting for caches
2022-10-05T09:25:10.569Z	INFO	Stopping and waiting for webhooks
2022-10-05T09:25:10.569Z	INFO	Wait completed, proceeding to shutdown the manager
2022-10-05T09:25:10.569Z	ERROR	setup	problem running manager	{"error": "failed to wait for rayjob caches to sync: timed out waiting for cache to be synced"}
main.main
	/workspace/main.go:121
runtime.main
	/usr/local/go/src/runtime/proc.go:255

While the kuberay-operator was in the restarting loop, I COUND NOT update the ray-cluster-kuberay-head-xxx by changing the valules.yaml, e.g.: I changed the num-cpus to 0 from 1

# andrew-values.yaml
head:
  initArgs:
    num-cpus: '0'

Then trire to use the following command to update the Ray Head pod but nothing happended:

$ helm upgrade ray-cluster . -n ray-system -f andrew-values.yaml

The workaround I used was I deleted the release and reinstalled it.

$ helm delete ray-cluster -n ray-system
$ helm install ray-cluster . -n ray-system -f andrew-values.yaml

System info:

GKE version: 1.21.14-gke.2700
Python version: 3.8
Ray version: 2.0
KuberRay operator version: 0.3

More information about pod and deployment

Pod

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2022-10-05T08:55:14Z"
  generateName: kuberay-operator-59d4ddc7f4-
  labels:
    app.kubernetes.io/instance: kuberay-operator
    app.kubernetes.io/name: kuberay-operator
    pod-template-hash: 59d4ddc7f4
  name: kuberay-operator-59d4ddc7f4-zb526
  namespace: ray-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: kuberay-operator-59d4ddc7f4
    uid: 67db665d-3ed8-415f-95af-42e22613a0dd
  resourceVersion: "314941902"
  uid: a3f302c1-2a5d-475c-8d65-73268c21b414
spec:
  containers:
  - command:
    - /manager
    image: kuberay/operator:v0.3.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /metrics
        port: http
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: kuberay-operator
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 5
      httpGet:
        path: /metrics
        port: http
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
      requests:
        cpu: 100m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-jd5qn
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: gke-flow-nap-e2-highcpu-32-1j4v45cx-2d559bfc-lsct
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: gke.io/optimize-utilization-scheduler
  securityContext: {}
  serviceAccount: kuberay-operator
  serviceAccountName: kuberay-operator
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-jd5qn
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-10-05T08:55:14Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-10-06T03:48:52Z"
    message: 'containers with unready status: [kuberay-operator]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-10-06T03:48:52Z"
    message: 'containers with unready status: [kuberay-operator]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-10-05T08:55:14Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://0f1b5cf81f9e3fa982daf65d4a9ba4a86a706f673578736c2f3684d94fe702bf
    image: docker.io/kuberay/operator:v0.3.0
    imageID: docker.io/kuberay/operator@sha256:a3d78f17dd16afa6dfa2e96a7ce10d537b98b956358af98c5cfe248f86a2066d
    lastState:
      terminated:
        containerID: containerd://0f1b5cf81f9e3fa982daf65d4a9ba4a86a706f673578736c2f3684d94fe702bf
        exitCode: 1
        finishedAt: "2022-10-06T03:48:51Z"
        reason: Error
        startedAt: "2022-10-06T03:46:29Z"
    name: kuberay-operator
    ready: false
    restartCount: 155
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=kuberay-operator pod=kuberay-operator-59d4ddc7f4-zb526_ray-system(a3f302c1-2a5d-475c-8d65-73268c21b414)
        reason: CrashLoopBackOff
  hostIP: 10.0.10.14
  phase: Running
  podIP: 10.160.7.38
  podIPs:
  - ip: 10.160.7.38
  qosClass: Guaranteed
  startTime: "2022-10-05T08:55:14Z"

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    meta.helm.sh/release-name: kuberay-operator
    meta.helm.sh/release-namespace: ray-system
  creationTimestamp: "2022-10-05T08:55:14Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: kuberay-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kuberay-operator
    app.kubernetes.io/version: "1.0"
    helm.sh/chart: kuberay-operator-0.3.0
  name: kuberay-operator
  namespace: ray-system
  resourceVersion: "314938654"
  uid: 5b38f28a-eaa4-44ba-a46d-c08d92c2d13f
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: kuberay-operator
      app.kubernetes.io/name: kuberay-operator
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: kuberay-operator
        app.kubernetes.io/name: kuberay-operator
    spec:
      containers:
      - command:
        - /manager
        image: kuberay/operator:v0.3.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: kuberay-operator
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 100m
            memory: 128Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kuberay-operator
      serviceAccountName: kuberay-operator
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-10-05T08:55:14Z"
    lastUpdateTime: "2022-10-05T08:55:39Z"
    message: ReplicaSet "kuberay-operator-59d4ddc7f4" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-10-06T03:46:39Z"
    lastUpdateTime: "2022-10-06T03:46:39Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Hi @Dmitri ,

The kuberay-operator-xxx pod keep restarting. It is possible related to RayJob?

I saw many RayJob in the error log of the kuberay-operator-xxx pod, and I don’t have this CRD in my K8s cluster.

2022-10-05T09:25:10.567Z	ERROR	controller.rayjob	Could not wait for Cache to sync	{"reconciler group": "ray.io", "reconciler kind": "RayJob", "error": "failed to wait for rayjob caches to sync: timed out waiting for cache to be synced"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/manager/runnable_group.go:218
2022-10-05T09:25:10.569Z	ERROR	setup	problem running manager	{"error": "failed to wait for rayjob caches to sync: timed out waiting for cache to be synced"}
main.main
	/workspace/main.go:121
runtime.main
	/usr/local/go/src/runtime/proc.go:255

I think we figured out what’s going on.

I used the helm install command to deploy KubeRay v0.3.0.

$ helm install kuberay-operator --namespace ray-system --create-namespace $(curl -s https://api.github.com/repos/ray-project/kuberay/releases/tags/v0.3.0 | grep '"browser_download_url":' | sort | grep -om1 'https.*helm-chart-kuberay-operator.*tgz')

Turns out, the crds of Helm Chart is missing the ray.io_rayjobs.yaml file. So, I copied the ray.io_rayjobs.yaml file from ray-operator/config/crd/bases/ to the Helm Chart and reinstall it to K8s. The kuberay-operator-xxx pod is not restarting anymore.

I saw the ray.io_rayjobs.yaml file already in the master branch, I guess this won’t be a issue for the further version.


Besides, the first time I deployed the KubeRay operator was trying to use the kubectl create command

$ kubectl create -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v0.3.0&timeout=90s"

However, this command doesn’t working, it’s alway timeout for me. Maybe the link is broken.

We’ve indeed fixed the issue with the missing RayJob CRD. (We also added CI validation that CRDs are in sync.)

The timeout issue is surprising. I will try it out again. Internally, I believe the kubectl create -k command pull the KubeRay git repo. One idea is to extend the timeout further. Another (inconvenient) alternative is to actually first pull the 0.3.0 branch of the repo and then
kubectl create -k kuberay/ray-operator/config/crd.

To enable upgrading pod configuration with KubeRay, you can add the flag --forced-cluster-upgrade to the operator deployment’s entrypoint.

The flag is experimental. In particular, there are currently some limitations that prevent using this flag in K8s environments that mutate pod resource configuration: [Bug] --forced-cluster-upgrade Causes termination loop for ray head node · Issue #558 · ray-project/kuberay · GitHub

More context on cluster config upgrades here: [Feature] rolling upgrade design and implementation for Kuberay · Issue #527 · ray-project/kuberay · GitHub

Thanks Dmitri,

I’ll check the informations you provide.