How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi Ray Team,
I started to migrate my Ray cluster from the legacy K8s operator to the KubeRay. I deployed the KubeRay via Helm Chart. I saw a strange situation in which the kuberay-operator
kept restarting in a short period.
NAME READY STATUS RESTARTS AGE
kuberay-operator-59d4ddc7f4-9kl6s 1/1 Running 7 28m
ray-cluster-kuberay-head-xrqh2 1/1 Running 0 28m
ray-cluster-kuberay-worker-workergroup-4wf2s 1/1 Running 0 28m
You can see the above outputs show the kuberay-operator
just restarted 7 times in 28 minutes…
When the kuberay-operator
down the status showed CrashLoopBackOff
.
NAME READY STATUS RESTARTS AGE
kuberay-operator-59d4ddc7f4-9kl6s 0/1 CrashLoopBackOff 9 46m
The following logs I captured when it went down.
2022-10-03T08:29:11.092Z INFO setup the operator {"version:": ""}
2022-10-03T08:29:11.092Z INFO setup Feature flag prioritize-workers-to-delete is enabled.
I1003 08:29:12.193426 1 request.go:665] Waited for 1.003580509s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/cert-manager.io/v1beta1?timeout=32s
2022-10-03T08:29:14.990Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
2022-10-03T08:29:14.990Z INFO setup starting manager
2022-10-03T08:29:14.991Z INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2022-10-03T08:29:14.991Z INFO Starting server {"kind": "health probe", "addr": "[::]:8082"}
I1003 08:29:14.991246 1 leaderelection.go:248] attempting to acquire leader lease ray-system/ray-operator-leader...
I1003 08:29:33.085800 1 leaderelection.go:258] successfully acquired lease ray-system/ray-operator-leader
2022-10-03T08:29:33.085Z DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"ray-system","name":"ray-operator-leader","uid":"a4b18352-f4b4-41a1-a6f4-cf45b838663c","apiVersion":"v1","resourceVersion":"309279038"}, "reason": "LeaderElection", "message": "kuberay-operator-59d4ddc7f4-9kl6s_b7d9dbfe-a103-4600-94db-fc64e5931119 became leader"}
2022-10-03T08:29:33.086Z DEBUG events Normal {"object": {"kind":"Lease","namespace":"ray-system","name":"ray-operator-leader","uid":"53e54feb-6db9-40da-812d-e2c9b776a722","apiVersion":"coordination.k8s.io/v1","resourceVersion":"309279040"}, "reason": "LeaderElection", "message": "kuberay-operator-59d4ddc7f4-9kl6s_b7d9dbfe-a103-4600-94db-fc64e5931119 became leader"}
2022-10-03T08:29:33.086Z INFO controller.raycluster-controller Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-03T08:29:33.086Z INFO controller.raycluster-controller Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Event"}
2022-10-03T08:29:33.086Z INFO controller.raycluster-controller Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Pod"}
2022-10-03T08:29:33.086Z INFO controller.raycluster-controller Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "source": "kind source: *v1.Service"}
2022-10-03T08:29:33.086Z INFO controller.raycluster-controller Starting Controller {"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
2022-10-03T08:29:33.086Z INFO controller.rayjob Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1alpha1.RayJob"}
2022-10-03T08:29:33.086Z INFO controller.rayjob Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-03T08:29:33.086Z INFO controller.rayjob Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayJob", "source": "kind source: *v1.Service"}
2022-10-03T08:29:33.086Z INFO controller.rayjob Starting Controller {"reconciler group": "ray.io", "reconciler kind": "RayJob"}
2022-10-03T08:29:33.087Z INFO controller.rayservice Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1alpha1.RayService"}
2022-10-03T08:29:33.087Z INFO controller.rayservice Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1alpha1.RayCluster"}
2022-10-03T08:29:33.087Z INFO controller.rayservice Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1.Service"}
2022-10-03T08:29:33.087Z INFO controller.rayservice Starting EventSource {"reconciler group": "ray.io", "reconciler kind": "RayService", "source": "kind source: *v1.Ingress"}
2022-10-03T08:29:33.087Z INFO controller.rayservice Starting Controller {"reconciler group": "ray.io", "reconciler kind": "RayService"}
I1003 08:29:33.689070 1 trace.go:205] Trace[610244577]: "DeltaFIFO Pop Process" ID:mark-lee/nogpu,Depth:36,Reason:slow event handlers blocking the queue (03-Oct-2022 08:29:33.491) (total time: 196ms):
Trace[610244577]: [196.950657ms] [196.950657ms] END
I1003 08:29:34.344988 1 request.go:665] Waited for 1.048843374s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/coordination.k8s.io/v1beta1?timeout=32s
2022-10-03T08:29:36.698Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:580
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
2022-10-03T08:29:36.699Z INFO controller.raycluster-controller Starting workers {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "worker count": 1}
2022-10-03T08:29:36.699Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "ray-cluster-kuberay"}
2022-10-03T08:29:36.699Z INFO controllers.RayCluster reconcileServices {"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-03T08:29:36.699Z INFO controllers.RayCluster reconcilePods {"head pod found": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.699Z INFO controllers.RayCluster reconcilePods {"head pod is up and running... checking workers": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.699Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-03T08:29:36.699Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "workergroup"}
2022-10-03T08:29:36.712Z INFO controllers.RayCluster reconcile RayCluster Event {"event name": "kuberay-operator-59d4ddc7f4-9kl6s.171a80749ff68a31"}
2022-10-03T08:29:36.713Z INFO controllers.RayCluster FT not enabled skipping event reconcile for pod. {"pod name": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.713Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "ray-cluster-kuberay"}
2022-10-03T08:29:36.713Z INFO controllers.RayCluster reconcileServices {"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-03T08:29:36.713Z INFO controllers.RayCluster reconcilePods {"head pod found": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.713Z INFO controllers.RayCluster reconcilePods {"head pod is up and running... checking workers": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.713Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-03T08:29:36.713Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "workergroup"}
2022-10-03T08:29:36.796Z ERROR controllers.RayCluster Update status error {"cluster name": "ray-cluster-kuberay", "error": "Operation cannot be fulfilled on rayclusters.ray.io \"ray-cluster-kuberay\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile
/workspace/controllers/ray/raycluster_controller.go:95
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
2022-10-03T08:29:36.797Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "ray-cluster-kuberay"}
2022-10-03T08:29:36.797Z INFO controllers.RayCluster reconcileServices {"headService service found": "ray-cluster-kuberay-head-svc"}
2022-10-03T08:29:36.797Z INFO controllers.RayCluster reconcilePods {"head pod found": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.797Z INFO controllers.RayCluster reconcilePods {"head pod is up and running... checking workers": "ray-cluster-kuberay-head-xrqh2"}
2022-10-03T08:29:36.797Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "workergroup"}
2022-10-03T08:29:36.797Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "workergroup"}
2022-10-03T08:29:36.800Z INFO controller.rayservice Starting workers {"reconciler group": "ray.io", "reconciler kind": "RayService", "worker count": 1}
I1003 08:29:47.750384 1 request.go:665] Waited for 1.042937768s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
2022-10-03T08:29:50.204Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:29:57.788582 1 request.go:665] Waited for 1.084089703s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/install.istio.io/v1alpha1?timeout=32s
2022-10-03T08:30:00.203Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:07.801017 1 request.go:665] Waited for 1.095882529s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/cert-manager.io/v1?timeout=32s
2022-10-03T08:30:10.691Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:17.849758 1 request.go:665] Waited for 1.145043946s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/operators.coreos.com/v2?timeout=32s
2022-10-03T08:30:20.203Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:27.850297 1 request.go:665] Waited for 1.145535029s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/operators.coreos.com/v1alpha2?timeout=32s
2022-10-03T08:30:30.203Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:37.850745 1 request.go:665] Waited for 1.145105639s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/internal.autoscaling.k8s.io/v1alpha1?timeout=32s
2022-10-03T08:30:40.392Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:47.901254 1 request.go:665] Waited for 1.195687309s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/policy/v1beta1?timeout=32s
2022-10-03T08:30:50.204Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:30:57.951035 1 request.go:665] Waited for 1.243269823s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/management.cattle.io/v3?timeout=32s
2022-10-03T08:31:00.213Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:31:08.000058 1 request.go:665] Waited for 1.29533055s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/scheduling.k8s.io/v1?timeout=32s
2022-10-03T08:31:10.204Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:31:18.000489 1 request.go:665] Waited for 1.296084477s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/install.istio.io/v1alpha1?timeout=32s
2022-10-03T08:31:20.203Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
I1003 08:31:28.050329 1 request.go:665] Waited for 1.345072049s due to client-side throttling, not priority and fairness, request: GET:https://10.164.0.1:443/apis/telemetry.istio.io/v1alpha1?timeout=32s
2022-10-03T08:31:30.493Z ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "RayJob.ray.io", "error": "no matches for kind \"RayJob\" in version \"ray.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.23.0/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/source/source.go:131
2022-10-03T08:31:33.087Z ERROR controller.rayjob Could not wait for Cache to sync {"reconciler group": "ray.io", "reconciler kind": "RayJob", "error": "failed to wait for rayjob caches to sync: timed out waiting for cache to be synced"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/manager/runnable_group.go:218
2022-10-03T08:31:33.087Z INFO Stopping and waiting for non leader election runnables
2022-10-03T08:31:33.087Z INFO Stopping and waiting for leader election runnables
2022-10-03T08:31:33.087Z INFO controller.rayservice Shutdown signal received, waiting for all workers to finish {"reconciler group": "ray.io", "reconciler kind": "RayService"}
2022-10-03T08:31:33.087Z INFO controller.raycluster-controller Shutdown signal received, waiting for all workers to finish {"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
2022-10-03T08:31:33.087Z INFO controller.rayservice All workers finished {"reconciler group": "ray.io", "reconciler kind": "RayService"}
2022-10-03T08:31:33.087Z INFO controller.raycluster-controller All workers finished {"reconciler group": "ray.io", "reconciler kind": "RayCluster"}
2022-10-03T08:31:33.087Z INFO Stopping and waiting for caches
2022-10-03T08:31:33.087Z INFO Stopping and waiting for webhooks
2022-10-03T08:31:33.087Z INFO Wait completed, proceeding to shutdown the manager
2022-10-03T08:31:33.087Z ERROR setup problem running manager {"error": "failed to wait for rayjob caches to sync: timed out waiting for cache to be synced"}
main.main
/workspace/main.go:121
runtime.main
/usr/local/go/src/runtime/proc.go:255
Some questions:
- Is it stop monitoring the status of the Ray cluster when the
kuberay-operator
go down? I noticed when thekuberay-operator
went down, I couldn’t update the configuration of the Ray cluster, such asimage:tag
invalues.yaml
. - I just use the default
vaules.yaml
. Is it the configuration issue of thelivenessProbe
andreadinessProbe
?livenessProbe: initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 5 readinessProbe: initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 5