Kuberay sample RayService not launching serve apps

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.
  1. Installed Kuberay
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.0
  1. Slightly updated rayservice-sample.yaml file (pasted later on) and packaging application code as a docker image instead of runtime env.

Dockerfile:

# File name: Dockerfile
FROM rayproject/ray:2.11.0

WORKDIR /app

COPY ./ray-test-app ./
  1. kubectl describe rayservice rayservice-sample:
Events:
  Type    Reason                       Age                   From                   Message
  ----    ------                       ----                  ----                   -------
  Normal  WaitForServeDeploymentReady  26m (x129 over 36m)   rayservice-controller  Fail to create / update Serve applications. If you observe this error consistently, please check "Issue 5: Fail to create / update Serve applications." in https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.html#kuberay-raysvc-troubleshoot for more details. err: Put "http://rayservice-sample-raycluster-7pwbq-head-svc.rubrik-spark.svc.cluster.local:8265/api/serve/applications/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Questions

Q1. RayService Quickstart — Ray 2.11.0 says that:

When the Ray Serve applications are healthy and ready, KubeRay creates a head service and a Ray Serve service for the RayService custom resource. For example, rayservice-sample-head-svc and rayservice-sample-serve-svc

Q1. Can someone help me with why the “Ray Serve applications are healthy and ready” condition not true? Why are there no actors getting launched?

Q2. The troubleshooting docs at RayService troubleshooting — Ray 3.0.0.dev0 does not list the issue I’m seeing, can we add this issue over that as well?

Q3. I do not see the /tmp/ray/session_latest/logs/serve/ in my head pod?

Other logs/artifacts

  1. Full rayservice-sample.yaml:
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-sample
spec:
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  serveConfigV2: |
    applications:
      - name: math_app
        import_path: conditional_dag.serve_dag
        route_prefix: /calc
        deployments:
          - name: Adder
            user_config:
              increment: 3
            max_ongoing_requests: 2
            autoscaling_config:
              target_ongoing_requests: 1
              metrics_interval_s: 0.2
              min_replicas: 0
              initial_replicas: 0
              max_replicas: 100
              look_back_period_s: 2
              downscale_delay_s: 10
              upscale_delay_s: 0
            graceful_shutdown_timeout_s: 5
            ray_actor_options:
              num_cpus: 0.5
          - name: Multiplier
            user_config:
              factor: 5
            max_ongoing_requests: 2
            autoscaling_config:
              target_ongoing_requests: 1
              metrics_interval_s: 0.2
              min_replicas: 0
              initial_replicas: 0
              max_replicas: 100
              look_back_period_s: 2
              downscale_delay_s: 10
              upscale_delay_s: 0
            graceful_shutdown_timeout_s: 5
            ray_actor_options:
              num_cpus: 1.5
          - name: Router
            max_ongoing_requests: 2
            autoscaling_config:
              target_ongoing_requests: 1
              metrics_interval_s: 0.2
              min_replicas: 1
              max_replicas: 100
              look_back_period_s: 2
              downscale_delay_s: 10
              upscale_delay_s: 0
            graceful_shutdown_timeout_s: 5
            ray_actor_options:
                  num_cpus: 1
  rayClusterConfig:
    rayVersion: '2.11.0' # should match the Ray version in the image of the containers
#    enableInTreeAutoscaling: true
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: gcr.io/spark-dev-083/ray/sample-app:autoscaler
              resources:
                limits:
                  cpu: 2
                  memory: 2Gi
                requests:
                  cpu: 2
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 50
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: gcr.io/spark-dev-083/ray/sample-app:autoscaler
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: 3
                    memory: "2Gi"
                  requests:
                    cpu: 3
                    memory: "2Gi"
  1. Full output of kubectl describe rayservice rayservice-sample:
kubectl describe rayservice rayservice-sample
Name:         rayservice-sample
Namespace:    rubrik-spark
Labels:       <none>
Annotations:  <none>
API Version:  ray.io/v1
Kind:         RayService
Metadata:
  Creation Timestamp:  2024-04-19T21:14:09Z
  Generation:          3
  Managed Fields:
    API Version:  ray.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:rayClusterConfig:
          .:
          f:headGroupSpec:
            .:
            f:rayStartParams:
              .:
              f:dashboard-host:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
          f:rayVersion:
          f:workerGroupSpecs:
        f:serveConfigV2:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2024-04-19T21:27:20Z
    API Version:  ray.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:activeServiceStatus:
          .:
          f:rayClusterStatus:
            .:
            f:desiredCPU:
            f:desiredGPU:
            f:desiredMemory:
            f:desiredTPU:
            f:desiredWorkerReplicas:
            f:endpoints:
              .:
              f:client:
              f:dashboard:
              f:gcs-server:
              f:metrics:
              f:serve:
            f:head:
              .:
              f:podIP:
              f:serviceIP:
            f:lastUpdateTime:
            f:maxWorkerReplicas:
            f:minWorkerReplicas:
            f:observedGeneration:
        f:observedGeneration:
        f:pendingServiceStatus:
          .:
          f:rayClusterName:
          f:rayClusterStatus:
            .:
            f:desiredCPU:
            f:desiredGPU:
            f:desiredMemory:
            f:desiredTPU:
            f:head:
        f:serviceStatus:
    Manager:         kuberay-operator
    Operation:       Update
    Subresource:     status
    Time:            2024-04-19T21:27:25Z
  Resource Version:  556781297
  UID:               07255e8a-788b-49d8-8a66-04ce7f996caa
Spec:
  Ray Cluster Config:
    Head Group Spec:
      Ray Start Params:
        Dashboard - Host:  0.0.0.0
      Template:
        Spec:
          Containers:
            Image:  gcr.io/spark-dev-083/ray/sample-app:autoscaler
            Name:   ray-head
            Ports:
              Container Port:  6379
              Name:            gcs-server
              Protocol:        TCP
              Container Port:  8265
              Name:            dashboard
              Protocol:        TCP
              Container Port:  10001
              Name:            client
              Protocol:        TCP
              Container Port:  8000
              Name:            serve
              Protocol:        TCP
            Resources:
              Limits:
                Cpu:     2
                Memory:  2Gi
              Requests:
                Cpu:     2
                Memory:  2Gi
    Ray Version:         2.11.0
    Worker Group Specs:
      Group Name:    small-group
      Max Replicas:  50
      Min Replicas:  1
      Num Of Hosts:  1
      Ray Start Params:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:  gcr.io/spark-dev-083/ray/sample-app:autoscaler
            Lifecycle:
              Pre Stop:
                Exec:
                  Command:
                    /bin/sh
                    -c
                    ray stop
            Name:  ray-worker
            Resources:
              Limits:
                Cpu:     3
                Memory:  2Gi
              Requests:
                Cpu:     3
                Memory:  2Gi
  serveConfigV2:         applications:
  - name: math_app
    import_path: conditional_dag.serve_dag
    route_prefix: /calc
    deployments:
      - name: Adder
        user_config:
          increment: 3
        max_ongoing_requests: 2
        autoscaling_config:
          target_ongoing_requests: 1
          metrics_interval_s: 0.2
          min_replicas: 0
          initial_replicas: 0
          max_replicas: 100
          look_back_period_s: 2
          downscale_delay_s: 10
          upscale_delay_s: 0
        graceful_shutdown_timeout_s: 5
        ray_actor_options:
          num_cpus: 0.5
      - name: Multiplier
        user_config:
          factor: 5
        max_ongoing_requests: 2
        autoscaling_config:
          target_ongoing_requests: 1
          metrics_interval_s: 0.2
          min_replicas: 0
          initial_replicas: 0
          max_replicas: 100
          look_back_period_s: 2
          downscale_delay_s: 10
          upscale_delay_s: 0
        graceful_shutdown_timeout_s: 5
        ray_actor_options:
          num_cpus: 1.5
      - name: Router
        max_ongoing_requests: 2
        autoscaling_config:
          target_ongoing_requests: 1
          metrics_interval_s: 0.2
          min_replicas: 1
          max_replicas: 100
          look_back_period_s: 2
          downscale_delay_s: 10
          upscale_delay_s: 0
        graceful_shutdown_timeout_s: 5
        ray_actor_options:
              num_cpus: 1

Status:
  Active Service Status:
    Ray Cluster Status:
      Desired CPU:              5
      Desired GPU:              0
      Desired Memory:           4Gi
      Desired TPU:              0
      Desired Worker Replicas:  1
      Endpoints:
        Client:        10001
        Dashboard:     8265
        Gcs - Server:  6379
        Metrics:       8080
        Serve:         8000
      Head:
        Pod IP:             10.88.22.45
        Service IP:         10.92.12.75
      Last Update Time:     2024-04-19T21:27:21Z
      Max Worker Replicas:  50
      Min Worker Replicas:  1
      Observed Generation:  1
  Observed Generation:      3
  Pending Service Status:
    Ray Cluster Name:  rayservice-sample-raycluster-7pwbq
    Ray Cluster Status:
      Desired CPU:     0
      Desired GPU:     0
      Desired Memory:  0
      Desired TPU:     0
      Head:
  Service Status:  WaitForServeDeploymentReady
Events:
  Type    Reason                       Age                   From                   Message
  ----    ------                       ----                  ----                   -------
  Normal  WaitForServeDeploymentReady  26m (x129 over 36m)   rayservice-controller  Fail to create / update Serve applications. If you observe this error consistently, please check "Issue 5: Fail to create / update Serve applications." in https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.html#kuberay-raysvc-troubleshoot for more details. err: Put "http://rayservice-sample-raycluster-7pwbq-head-svc.rubrik-spark.svc.cluster.local:8265/api/serve/applications/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
1 Like
  1. Kuberay operator logs: kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log
{"level":"info","ts":"2024-04-20T01:11:24.753Z","logger":"setup","msg":"Flag watchNamespace is not set. Watch custom resources in all namespaces."}
{"level":"info","ts":"2024-04-20T01:11:24.753Z","logger":"setup","msg":"Setup manager"}
{"level":"info","ts":"2024-04-20T01:11:24.861Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2024-04-20T01:11:24.861Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2024-04-20T01:11:24.861Z","logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":8080","secure":false}
{"level":"info","ts":"2024-04-20T01:11:24.861Z","msg":"starting server","kind":"health probe","addr":"[::]:8082"}
I0420 01:11:27.650475       1 leaderelection.go:250] attempting to acquire leader lease rubrik-spark/ray-operator-leader...
I0420 01:11:43.072466       1 leaderelection.go:260] successfully acquired lease rubrik-spark/ray-operator-leader
{"level":"info","ts":"2024-04-20T01:11:43.073Z","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.RayCluster"}
{"level":"info","ts":"2024-04-20T01:11:43.073Z","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.Pod"}
{"level":"info","ts":"2024-04-20T01:11:43.073Z","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.Service"}
{"level":"info","ts":"2024-04-20T01:11:43.073Z","logger":"controllers.RayCluster","msg":"Starting Controller"}
{"level":"info","ts":"2024-04-20T01:11:43.073Z","logger":"controllers.RayJob","msg":"Starting EventSource","source":"kind source: *v1.RayJob"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayJob","msg":"Starting EventSource","source":"kind source: *v1.RayCluster"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayJob","msg":"Starting EventSource","source":"kind source: *v1.Service"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayJob","msg":"Starting EventSource","source":"kind source: *v1.Job"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayJob","msg":"Starting Controller"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayService","msg":"Starting EventSource","source":"kind source: *v1.RayService"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayService","msg":"Starting EventSource","source":"kind source: *v1.RayCluster"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayService","msg":"Starting EventSource","source":"kind source: *v1.Service"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayService","msg":"Starting EventSource","source":"kind source: *v1.Ingress"}
{"level":"info","ts":"2024-04-20T01:11:43.074Z","logger":"controllers.RayService","msg":"Starting Controller"}
{"level":"info","ts":"2024-04-20T01:11:43.975Z","logger":"controllers.RayCluster","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-04-20T01:11:43.980Z","logger":"controllers.RayService","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-04-20T01:11:44.051Z","logger":"controllers.RayJob","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-04-20T01:14:13.968Z","logger":"controllers.RayService","msg":"No active Ray cluster. RayService operator should prepare a new Ray cluster.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"a7afdd7c-47d2-4012-be34-18a62df2936f"}
{"level":"info","ts":"2024-04-20T01:14:13.968Z","logger":"controllers.RayService","msg":"Current cluster is unhealthy, prepare to restart.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"a7afdd7c-47d2-4012-be34-18a62df2936f","Status":{"activeServiceStatus":{"rayClusterStatus":{"desiredCPU":"0","desiredMemory":"0","desiredGPU":"0","desiredTPU":"0","head":{}}},"pendingServiceStatus":{"rayClusterStatus":{"desiredCPU":"0","desiredMemory":"0","desiredGPU":"0","desiredTPU":"0","head":{}}},"observedGeneration":1}}
{"level":"info","ts":"2024-04-20T01:14:13.986Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"spec.rayClusterConfig.headGroupSpec.template.metadata.creationTimestamp\""}
{"level":"info","ts":"2024-04-20T01:14:13.986Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"spec.rayClusterConfig.workerGroupSpecs[0].template.metadata.creationTimestamp\""}
{"level":"info","ts":"2024-04-20T01:14:14.050Z","logger":"controllers.RayService","msg":"Done reconcileRayCluster update status, enter next loop to create new ray cluster.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"a7afdd7c-47d2-4012-be34-18a62df2936f"}
{"level":"info","ts":"2024-04-20T01:14:16.051Z","logger":"controllers.RayService","msg":"Creating a new pending RayCluster instance.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"afbab1ea-067b-4d31-8b07-553026007b3a"}
{"level":"info","ts":"2024-04-20T01:14:16.051Z","logger":"controllers.RayService","msg":"createRayClusterInstance","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"afbab1ea-067b-4d31-8b07-553026007b3a","rayClusterInstanceName":"rayservice-sample-raycluster-9s5d7"}
{"level":"info","ts":"2024-04-20T01:14:16.051Z","logger":"controllers.RayService","msg":"No pending RayCluster, creating RayCluster.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"afbab1ea-067b-4d31-8b07-553026007b3a"}
{"level":"info","ts":"2024-04-20T01:14:16.063Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"spec.headGroupSpec.template.metadata.creationTimestamp\""}
{"level":"info","ts":"2024-04-20T01:14:16.063Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"spec.workerGroupSpecs[0].template.metadata.creationTimestamp\""}
{"level":"info","ts":"2024-04-20T01:14:16.064Z","logger":"controllers.RayService","msg":"created rayCluster for rayService","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"afbab1ea-067b-4d31-8b07-553026007b3a","rayCluster":{"namespace":"rubrik-spark","name":"rayservice-sample-raycluster-9s5d7"}}
{"level":"info","ts":"2024-04-20T01:14:16.064Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"afbab1ea-067b-4d31-8b07-553026007b3a","RayCluster name":"rayservice-sample-raycluster-9s5d7"}
{"level":"info","ts":"2024-04-20T01:14:16.064Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265"}
{"level":"error","ts":"2024-04-20T01:14:16.064Z","logger":"controllers.RayService","msg":"Failed to check if head Pod is running and ready!","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"afbab1ea-067b-4d31-8b07-553026007b3a","error":"Found 0 head pods for RayCluster rayservice-sample-raycluster-9s5d7 in the namespace rubrik-spark","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).reconcileServe\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:1069\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:168\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-04-20T01:14:16.064Z","logger":"controllers.RayService","msg":"Fail to reconcileServe.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"afbab1ea-067b-4d31-8b07-553026007b3a","error":"Found 0 head pods for RayCluster rayservice-sample-raycluster-9s5d7 in the namespace rubrik-spark","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-04-20T01:14:16.065Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"3354d6f3-9a15-4a87-b56b-9eb7c809f7f1","RayCluster name":"rayservice-sample-raycluster-9s5d7"}
{"level":"error","ts":"2024-04-20T01:14:16.066Z","logger":"controllers.RayService","msg":"Failed to check if head Pod is running and ready!","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"3354d6f3-9a15-4a87-b56b-9eb7c809f7f1","error":"Found 0 head pods for RayCluster rayservice-sample-raycluster-9s5d7 in the namespace rubrik-spark","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).reconcileServe\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:1069\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:168\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-04-20T01:14:16.066Z","logger":"controllers.RayService","msg":"Fail to reconcileServe.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"3354d6f3-9a15-4a87-b56b-9eb7c809f7f1","error":"Found 0 head pods for RayCluster rayservice-sample-raycluster-9s5d7 in the namespace rubrik-spark","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-04-20T01:14:16.106Z","logger":"controllers.RayCluster","msg":"Pod Service created successfully","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","service name":"rayservice-sample-raycluster-9s5d7-head-svc"}
{"level":"info","ts":"2024-04-20T01:14:16.107Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","Found 0 head Pods; creating a head Pod for the RayCluster.":"rayservice-sample-raycluster-9s5d7"}
{"level":"info","ts":"2024-04-20T01:14:16.107Z","logger":"controllers.RayCluster","msg":"head pod labels","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","labels":{"app.kubernetes.io/created-by":"kuberay-operator","app.kubernetes.io/name":"kuberay","ray.io/cluster":"rayservice-sample-raycluster-9s5d7","ray.io/group":"headgroup","ray.io/identifier":"rayservice-sample-raycluster-9s5d7-head","ray.io/is-ray-node":"yes","ray.io/node-type":"head"}}
{"level":"info","ts":"2024-04-20T01:14:16.107Z","logger":"controllers.RayCluster","msg":"generateRayStartCommand","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","nodeType":"head","rayStartParams":{"block":"true","dashboard-agent-listen-port":"52365","dashboard-host":"0.0.0.0","metrics-export-port":"8080"},"Ray container resource":{"limits":{"cpu":"2","memory":"2Gi"},"requests":{"cpu":"2","memory":"2Gi"}}}
{"level":"info","ts":"2024-04-20T01:14:16.107Z","logger":"controllers.RayCluster","msg":"generateRayStartCommand","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","rayStartCmd":"ray start --head  --dashboard-agent-listen-port=52365  --num-cpus=2  --memory=2147483648  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --block "}
{"level":"info","ts":"2024-04-20T01:14:16.107Z","logger":"controllers.RayCluster","msg":"BuildPod","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","rayNodeType":"head","generatedCmd":"ulimit -n 65536; ray start --head  --dashboard-agent-listen-port=52365  --num-cpus=2  --memory=2147483648  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --block "}
{"level":"info","ts":"2024-04-20T01:14:16.107Z","logger":"controllers.RayCluster","msg":"Probes injection feature flag","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","enabled":true}
{"level":"info","ts":"2024-04-20T01:14:16.107Z","logger":"controllers.RayCluster","msg":"createHeadPod","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","head pod with name":"rayservice-sample-raycluster-9s5d7-head-"}
{"level":"info","ts":"2024-04-20T01:14:16.150Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","desired workerReplicas (always adhering to minReplicas/maxReplica)":1,"worker group":"small-group","maxReplicas":50,"minReplicas":1,"replicas":1}
{"level":"info","ts":"2024-04-20T01:14:16.151Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","removing the pods in the scaleStrategy of":"small-group"}
{"level":"info","ts":"2024-04-20T01:14:16.151Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","workerReplicas":1,"runningPods":0,"diff":1}
{"level":"info","ts":"2024-04-20T01:14:16.151Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","Number workers to add":1,"Worker group":"small-group"}
{"level":"info","ts":"2024-04-20T01:14:16.151Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","creating worker for group":"small-group","index 0":"in total 1"}
{"level":"info","ts":"2024-04-20T01:14:16.151Z","logger":"controllers.RayCluster","msg":"generateRayStartCommand","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","nodeType":"worker","rayStartParams":{"address":"rayservice-sample-raycluster-9s5d7-head-svc.rubrik-spark.svc.cluster.local:6379","block":"true","dashboard-agent-listen-port":"52365","metrics-export-port":"8080"},"Ray container resource":{"limits":{"cpu":"3","memory":"2Gi"},"requests":{"cpu":"3","memory":"2Gi"}}}
{"level":"info","ts":"2024-04-20T01:14:16.151Z","logger":"controllers.RayCluster","msg":"generateRayStartCommand","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","rayStartCmd":"ray start  --address=rayservice-sample-raycluster-9s5d7-head-svc.rubrik-spark.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --num-cpus=3  --memory=2147483648 "}
{"level":"info","ts":"2024-04-20T01:14:16.151Z","logger":"controllers.RayCluster","msg":"BuildPod","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","rayNodeType":"worker","generatedCmd":"ulimit -n 65536; ray start  --address=rayservice-sample-raycluster-9s5d7-head-svc.rubrik-spark.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --num-cpus=3  --memory=2147483648 "}
{"level":"info","ts":"2024-04-20T01:14:16.151Z","logger":"controllers.RayCluster","msg":"Probes injection feature flag","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","enabled":true}
{"level":"info","ts":"2024-04-20T01:14:16.174Z","logger":"controllers.RayCluster","msg":"Created pod","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","Pod ":"ervice-sample-raycluster-9s5d7-worker-small-group-"}
{"level":"info","ts":"2024-04-20T01:14:16.175Z","logger":"controllers.RayCluster","msg":"CheckAllPodsRunning: Pod is not running; Pod Name: rayservice-sample-raycluster-9s5d7-head-z5st7; Pod Status.Phase: Pending","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265"}
{"level":"info","ts":"2024-04-20T01:14:16.176Z","logger":"controllers.RayCluster","msg":"inconsistentRayClusterStatus","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","detect inconsistency":"old AvailableWorkerReplicas: 0, new AvailableWorkerReplicas: 0, old DesiredWorkerReplicas: 0, new DesiredWorkerReplicas: 1, old MinWorkerReplicas: 0, new MinWorkerReplicas: 1, old MaxWorkerReplicas: 0, new MaxWorkerReplicas: 50"}
{"level":"info","ts":"2024-04-20T01:14:16.176Z","logger":"controllers.RayCluster","msg":"rayClusterReconcile","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","Update CR status":"rayservice-sample-raycluster-9s5d7","status":{"desiredWorkerReplicas":1,"minWorkerReplicas":1,"maxWorkerReplicas":50,"desiredCPU":"5","desiredMemory":"4Gi","desiredGPU":"0","desiredTPU":"0","lastUpdateTime":"2024-04-20T01:14:16Z","endpoints":{"client":"10001","dashboard":"8265","gcs-server":"6379","metrics":"8080","serve":"8000"},"head":{"serviceIP":"10.92.15.52"},"observedGeneration":1}}
{"level":"info","ts":"2024-04-20T01:14:16.251Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","cluster name":"rayservice-sample-raycluster-9s5d7"}
{"level":"info","ts":"2024-04-20T01:14:16.251Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"07f82f27-a29c-41fc-a2eb-a5ce955d4265","cluster name":"rayservice-sample-raycluster-9s5d7","seconds":300}
{"level":"info","ts":"2024-04-20T01:14:16.251Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc"}
{"level":"info","ts":"2024-04-20T01:14:16.251Z","logger":"controllers.RayCluster","msg":"reconcileHeadService","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","1 head service found":"rayservice-sample-raycluster-9s5d7-head-svc"}
{"level":"info","ts":"2024-04-20T01:14:16.252Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","Found 1 head Pod":"rayservice-sample-raycluster-9s5d7-head-z5st7","Pod status":"Pending","Pod restart policy":"Always","Ray container terminated status":"nil"}
{"level":"info","ts":"2024-04-20T01:14:16.252Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","head Pod":"rayservice-sample-raycluster-9s5d7-head-z5st7","shouldDelete":false,"reason":"KubeRay does not need to delete the head Pod rayservice-sample-raycluster-9s5d7-head-z5st7. The Pod status is Pending, and the Ray container terminated status is nil."}
{"level":"info","ts":"2024-04-20T01:14:16.252Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","desired workerReplicas (always adhering to minReplicas/maxReplica)":1,"worker group":"small-group","maxReplicas":50,"minReplicas":1,"replicas":1}
{"level":"info","ts":"2024-04-20T01:14:16.252Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","worker Pod":"ervice-sample-raycluster-9s5d7-worker-small-group-5zqcc","shouldDelete":false,"reason":"KubeRay does not need to delete the worker Pod ervice-sample-raycluster-9s5d7-worker-small-group-5zqcc. The Pod status is Pending, and the Ray container terminated status is nil."}
{"level":"info","ts":"2024-04-20T01:14:16.252Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","removing the pods in the scaleStrategy of":"small-group"}
{"level":"info","ts":"2024-04-20T01:14:16.252Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","workerReplicas":1,"runningPods":1,"diff":0}
{"level":"info","ts":"2024-04-20T01:14:16.252Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","all workers already exist for group":"small-group"}
{"level":"info","ts":"2024-04-20T01:14:16.252Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"5935e12d-dfe7-4c36-9bf7-56bc5a3858a7","RayCluster name":"rayservice-sample-raycluster-9s5d7"}
{"level":"info","ts":"2024-04-20T01:14:16.253Z","logger":"controllers.RayCluster","msg":"CheckAllPodsRunning: Pod is not running; Pod Name: rayservice-sample-raycluster-9s5d7-head-z5st7; Pod Status.Phase: Pending","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc"}
{"level":"info","ts":"2024-04-20T01:14:16.253Z","logger":"controllers.RayService","msg":"Skipping the update of Serve deployments because the Ray head Pod is not ready.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"5935e12d-dfe7-4c36-9bf7-56bc5a3858a7"}
{"level":"info","ts":"2024-04-20T01:14:16.253Z","logger":"controllers.RayService","msg":"Ray Serve applications are not ready to serve requests: checking again in 2ss","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"5935e12d-dfe7-4c36-9bf7-56bc5a3858a7"}
{"level":"info","ts":"2024-04-20T01:14:16.254Z","logger":"controllers.RayCluster","msg":"inconsistentRayClusterStatus","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","detect inconsistency":"old AvailableWorkerReplicas: 0, new AvailableWorkerReplicas: 0, old DesiredWorkerReplicas: 0, new DesiredWorkerReplicas: 1, old MinWorkerReplicas: 0, new MinWorkerReplicas: 1, old MaxWorkerReplicas: 0, new MaxWorkerReplicas: 50"}
{"level":"info","ts":"2024-04-20T01:14:16.254Z","logger":"controllers.RayCluster","msg":"rayClusterReconcile","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","Update CR status":"rayservice-sample-raycluster-9s5d7","status":{"desiredWorkerReplicas":1,"minWorkerReplicas":1,"maxWorkerReplicas":50,"desiredCPU":"5","desiredMemory":"4Gi","desiredGPU":"0","desiredTPU":"0","lastUpdateTime":"2024-04-20T01:14:16Z","endpoints":{"client":"10001","dashboard":"8265","gcs-server":"6379","metrics":"8080","serve":"8000"},"head":{"serviceIP":"10.92.15.52"},"observedGeneration":1}}
{"level":"info","ts":"2024-04-20T01:14:16.262Z","logger":"controllers.RayCluster","msg":"Got error when updating status","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","cluster name":"rayservice-sample-raycluster-9s5d7","error":"Operation cannot be fulfilled on rayclusters.ray.io \"rayservice-sample-raycluster-9s5d7\": the object has been modified; please apply your changes to the latest version and try again","RayCluster":{"apiVersion":"ray.io/v1","kind":"RayCluster","namespace":"rubrik-spark","name":"rayservice-sample-raycluster-9s5d7"}}

…continued: kuberay operator logs:

{"level":"info","ts":"2024-04-20T01:14:16.262Z","logger":"controllers.RayCluster","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc"}
{"level":"error","ts":"2024-04-20T01:14:16.262Z","logger":"controllers.RayCluster","msg":"Reconciler error","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"44413d62-4bfb-4ea0-81cb-492cc75b00fc","error":"Operation cannot be fulfilled on rayclusters.ray.io \"rayservice-sample-raycluster-9s5d7\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-04-20T01:14:16.350Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0"}
{"level":"info","ts":"2024-04-20T01:14:16.350Z","logger":"controllers.RayCluster","msg":"reconcileHeadService","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","1 head service found":"rayservice-sample-raycluster-9s5d7-head-svc"}
{"level":"info","ts":"2024-04-20T01:14:16.351Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","Found 1 head Pod":"rayservice-sample-raycluster-9s5d7-head-z5st7","Pod status":"Pending","Pod restart policy":"Always","Ray container terminated status":"nil"}
{"level":"info","ts":"2024-04-20T01:14:16.351Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","head Pod":"rayservice-sample-raycluster-9s5d7-head-z5st7","shouldDelete":false,"reason":"KubeRay does not need to delete the head Pod rayservice-sample-raycluster-9s5d7-head-z5st7. The Pod status is Pending, and the Ray container terminated status is nil."}
{"level":"info","ts":"2024-04-20T01:14:16.351Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","desired workerReplicas (always adhering to minReplicas/maxReplica)":1,"worker group":"small-group","maxReplicas":50,"minReplicas":1,"replicas":1}
{"level":"info","ts":"2024-04-20T01:14:16.352Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","worker Pod":"ervice-sample-raycluster-9s5d7-worker-small-group-5zqcc","shouldDelete":false,"reason":"KubeRay does not need to delete the worker Pod ervice-sample-raycluster-9s5d7-worker-small-group-5zqcc. The Pod status is Pending, and the Ray container terminated status is nil."}
{"level":"info","ts":"2024-04-20T01:14:16.352Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","removing the pods in the scaleStrategy of":"small-group"}
{"level":"info","ts":"2024-04-20T01:14:16.352Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","workerReplicas":1,"runningPods":1,"diff":0}
{"level":"info","ts":"2024-04-20T01:14:16.352Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","all workers already exist for group":"small-group"}
{"level":"info","ts":"2024-04-20T01:14:16.353Z","logger":"controllers.RayCluster","msg":"CheckAllPodsRunning: Pod is not running; Pod Name: rayservice-sample-raycluster-9s5d7-head-z5st7; Pod Status.Phase: Pending","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0"}
{"level":"info","ts":"2024-04-20T01:14:16.354Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","cluster name":"rayservice-sample-raycluster-9s5d7"}
{"level":"info","ts":"2024-04-20T01:14:16.354Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"rayservice-sample-raycluster-9s5d7","namespace":"rubrik-spark"},"reconcileID":"4f046e1e-b68f-4bf8-b682-2811b52084a0","cluster name":"rayservice-sample-raycluster-9s5d7","seconds":300}
{"level":"info","ts":"2024-04-20T01:14:18.066Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"5d86c0a5-fcba-483f-8c12-bcff7a21bbac","RayCluster name":"rayservice-sample-raycluster-9s5d7"}
{"level":"info","ts":"2024-04-20T01:14:18.066Z","logger":"controllers.RayService","msg":"Skipping the update of Serve deployments because the Ray head Pod is not ready.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"5d86c0a5-fcba-483f-8c12-bcff7a21bbac"}
{"level":"info","ts":"2024-04-20T01:14:18.066Z","logger":"controllers.RayService","msg":"Ray Serve applications are not ready to serve requests: checking again in 2ss","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"5d86c0a5-fcba-483f-8c12-bcff7a21bbac"}
{"level":"info","ts":"2024-04-20T01:14:20.068Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"dcea53ea-5ac7-4f09-af58-1a50fb26906e","RayCluster name":"rayservice-sample-raycluster-9s5d7"}
{"level":"info","ts":"2024-04-20T01:14:20.069Z","logger":"controllers.RayService","msg":"Skipping the update of Serve deployments because the Ray head Pod is not ready.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"dcea53ea-5ac7-4f09-af58-1a50fb26906e"}
{"level":"info","ts":"2024-04-20T01:14:20.069Z","logger":"controllers.RayService","msg":"Ray Serve applications are not ready to serve requests: checking again in 2ss","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"dcea53ea-5ac7-4f09-af58-1a50fb26906e"}

1 Like

Update: I deleted the sample app by kubectl delete -f ../sample-app/ray-service.sample.yaml and re-deployed by kubectl apply -f ../sample-app/ray-service.sample.yaml, with the following changes to the yaml file:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-sample
...
  rayClusterConfig:
    rayVersion: '2.11.0' # should match the Ray version in the image of the containers
#    enableInTreeAutoscaling: true
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      ...
              readinessProbe:
                exec:
                  command:
                    - sh
                    - -c
                    - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep
                      success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep
                      success
                periodSeconds: 5
                timeoutSeconds: 60
                successThreshold: 1
                failureThreshold: 5
              livenessProbe:
                exec:
                  command:
                    - sh
                    - -c
                    - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success
                periodSeconds: 5
                timeoutSeconds: 60
                successThreshold: 1
                failureThreshold: 5
    workerGroupSpecs:
      ...
                livenessProbe:
                  exec:
                    command:
                      - sh
                      - -c
                      - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
                  periodSeconds: 5
                  timeoutSeconds: 60
                  successThreshold: 1
                  failureThreshold: 5

I saw this error in the operator logs:

{"level":"info","ts":"2024-04-20T02:43:55.060Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"1372cf45-186c-4c4c-ae4b-a9efe93b5985","RayCluster name":"rayservice-sample-raycluster-ch9j5"}
{"level":"info","ts":"2024-04-20T02:43:55.061Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"1372cf45-186c-4c4c-ae4b-a9efe93b5985","head service name":"rayservice-sample-raycluster-ch9j5-head-svc","namespace":"rubrik-spark"}
{"level":"info","ts":"2024-04-20T02:43:55.061Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"1372cf45-186c-4c4c-ae4b-a9efe93b5985","head service URL":"rayservice-sample-raycluster-ch9j5-head-svc.rubrik-spark.svc.cluster.local:8265","port":"dashboard"}
{"level":"info","ts":"2024-04-20T02:43:55.061Z","logger":"controllers.RayService","msg":"shouldUpdate","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"1372cf45-186c-4c4c-ae4b-a9efe93b5985","shouldUpdateServe":true,"reason":"Nothing has been cached for cluster rayservice-sample-raycluster-ch9j5 with key rubrik-spark/rayservice-sample/rayservice-sample-raycluster-ch9j5"}
{"level":"info","ts":"2024-04-20T02:43:55.061Z","logger":"controllers.RayService","msg":"updateServeDeployment","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"1372cf45-186c-4c4c-ae4b-a9efe93b5985","V2 config":"applications:\n  - name: math_app\n    import_path: conditional_dag.serve_dag\n    route_prefix: /calc\n    deployments:\n      - name: Adder\n        user_config:\n          increment: 3\n        max_ongoing_requests: 2\n        autoscaling_config:\n          target_ongoing_requests: 1\n          metrics_interval_s: 0.2\n          min_replicas: 0\n          initial_replicas: 0\n          max_replicas: 100\n          look_back_period_s: 2\n          downscale_delay_s: 10\n          upscale_delay_s: 0\n        graceful_shutdown_timeout_s: 5\n        ray_actor_options:\n          num_cpus: 0.5\n      - name: Multiplier\n        user_config:\n          factor: 5\n        max_ongoing_requests: 2\n        autoscaling_config:\n          target_ongoing_requests: 1\n          metrics_interval_s: 0.2\n          min_replicas: 0\n          initial_replicas: 0\n          max_replicas: 100\n          look_back_period_s: 2\n          downscale_delay_s: 10\n          upscale_delay_s: 0\n        graceful_shutdown_timeout_s: 5\n        ray_actor_options:\n          num_cpus: 1.5\n      - name: Router\n        max_ongoing_requests: 2\n        autoscaling_config:\n          target_ongoing_requests: 1\n          metrics_interval_s: 0.2\n          min_replicas: 1\n          max_replicas: 100\n          look_back_period_s: 2\n          downscale_delay_s: 10\n          upscale_delay_s: 0\n        graceful_shutdown_timeout_s: 5\n        ray_actor_options:\n              num_cpus: 1\n"}
{"level":"info","ts":"2024-04-20T02:43:55.062Z","logger":"controllers.RayService","msg":"updateServeDeployment","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"1372cf45-186c-4c4c-ae4b-a9efe93b5985","MULTI_APP json config":"{\"applications\":[{\"deployments\":[{\"autoscaling_config\":{\"downscale_delay_s\":10,\"initial_replicas\":0,\"look_back_period_s\":2,\"max_replicas\":100,\"metrics_interval_s\":0.2,\"min_replicas\":0,\"target_ongoing_requests\":1,\"upscale_delay_s\":0},\"graceful_shutdown_timeout_s\":5,\"max_ongoing_requests\":2,\"name\":\"Adder\",\"ray_actor_options\":{\"num_cpus\":0.5},\"user_config\":{\"increment\":3}},{\"autoscaling_config\":{\"downscale_delay_s\":10,\"initial_replicas\":0,\"look_back_period_s\":2,\"max_replicas\":100,\"metrics_interval_s\":0.2,\"min_replicas\":0,\"target_ongoing_requests\":1,\"upscale_delay_s\":0},\"graceful_shutdown_timeout_s\":5,\"max_ongoing_requests\":2,\"name\":\"Multiplier\",\"ray_actor_options\":{\"num_cpus\":1.5},\"user_config\":{\"factor\":5}},{\"autoscaling_config\":{\"downscale_delay_s\":10,\"look_back_period_s\":2,\"max_replicas\":100,\"metrics_interval_s\":0.2,\"min_replicas\":1,\"target_ongoing_requests\":1,\"upscale_delay_s\":0},\"graceful_shutdown_timeout_s\":5,\"max_ongoing_requests\":2,\"name\":\"Router\",\"ray_actor_options\":{\"num_cpus\":1}}],\"import_path\":\"conditional_dag.serve_dag\",\"name\":\"math_app\",\"route_prefix\":\"/calc\"}]}"}
{"level":"error","ts":"2024-04-20T02:43:57.077Z","logger":"controllers.RayService","msg":"Fail to reconcileServe.","RayService":{"name":"rayservice-sample","namespace":"rubrik-spark"},"reconcileID":"1372cf45-186c-4c4c-ae4b-a9efe93b5985","error":"Fail to create / update Serve applications. If you observe this error consistently, please check \"Issue 5: Fail to create / update Serve applications.\" in https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.html#kuberay-raysvc-troubleshoot for more details. err: Put \"http://rayservice-sample-raycluster-ch9j5-head-svc.rubrik-spark.svc.cluster.local:8265/api/serve/applications/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
1 Like

Could you exec into your cluster’s head node and run serve status? What’s the output?

Btw I’m running this app on GKE, fyi.

kubectl get pods -l=ray.io/is-ray-node=yes
NAME                                                      READY   STATUS     RESTARTS   AGE
ervice-sample-raycluster-ch9j5-worker-small-group-m7b4t   0/1     Init:0/1   0          69m
rayservice-sample-raycluster-ch9j5-head-5m57v             1/1     Running    0          2d17h

kubectl exec -it rayservice-sample-raycluster-ch9j5-head-5m57v -- /bin/bash

(base) ray@rayservice-sample-raycluster-ch9j5-head-5m57v:/app$
(base) ray@rayservice-sample-raycluster-ch9j5-head-5m57v:/app$
(base) ray@rayservice-sample-raycluster-ch9j5-head-5m57v:/app$ ls
__pycache__  conditional_dag.py
(base) ray@rayservice-sample-raycluster-ch9j5-head-5m57v:/app$ serve status
proxies: {}
applications: {}
target_capacity: null
serve status
proxies: {}
applications: {}
target_capacity: null

I can’t reproduce the issue. I published my image to Quay at quay.io/kevin85421/ray:test-dag, and you can try it. I use a local Kind cluster.

helm install kuberay-operator kuberay/kuberay-operator  --version 1.1.0

# Create a RayService with https://gist.github.com/kevin85421/fc292ee8b0e24fd9b422833e208dbf51
kubectl apply -f ray-service.sample-no-runtimeEnv.yaml

# Verify
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- bash

# In the head Pod
serve status

I tested kuberay with version ray 2.10.0-py311 and 2.9.0. I have the same problem. It fails to create rayservice-sample-head-svc and rayservice-sample-serve-svc. But I didn’t notice it right away.
And I only noticed it because I got them to run once. I tried uninstalling them all and running them again to check for reproducibility. Now there is no way to reproduce a successful launch of rayservice.

I tested versions of kuberay 1.1.0 (doesn’t work with argocd), 1.1.0-rc.1 and 1.1.0-rc.0.

I tried to deploy my own applications with runtimeEnv. The serve start is not started inside the vorkers. I connected the jar to the cluster in kuberay separately. On this separate jar I started serve start ... - so the applications started to be deployed. But still svc with universal names did not get up.

@Kai-Hsun_Chen repeated the steps mentioned by you in local Kind cluster:

$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
$ kubectl exec -it $HEAD_POD -- bash
error: unable to upgrade connection: container not found ("ray-head")

Solved by Slack

I’m having the same issue and don’t have access to slack - is it possible to post the fix here?

1 Like