Unable to connect to RayService with Ingress after update

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Community,

  1. I use this file: kuberay/ray_v1alpha1_rayservice.yaml at master · ray-project/kuberay · GitHub to deploy a Ray cluster on local Kind cluster, I also apply an ingress and everything works fine, but after I update the config of ray head (e.g. memory resource limit), the ingress doesn’t work and shows this error: {"message":"failure to get a peer from the ring-balancer"}. Both dashboard and ray serve can not be accessed. Here is my Ingress yaml:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    konghq.com/strip-path: "true"
  name: example-ingress
spec:
  ingressClassName: kong
  rules:
  - http:
      paths:
      - pathType: Prefix
        path: /serve
        backend:
          service:
            name: rayservice-sample-serve-svc
            port:
              number: 8000
      - pathType: Prefix
        path: /dashboard
        backend:
          service:
            name: rayservice-sample-head-svc
            port:
              number: 8265
  1. I deploy Ray Serve deployment with FastAPI and could access the SwaggerUI under the path /route_prefix/docs with port forwarding, but when I use the same ingress yaml file showed above, the SwaggerUI can not be loaded.

Thanks in advance for your help!

Can you try running through some of the suggestions from here?: kubernetes - Error {"message":"failure to get a peer from the ring-balancer"} using kong ingress - Stack Overflow

In particular, if you’re able to reach the services through port-forwarding, can you try dig raycluster-sample-serve-svc and dig raycluster-complete-head-svc from the ingress pod

Hi @ckw017 I tried these commands and pasted the output below. I tried this time without ingress and found I could reach the service through port-forwarding for the first time. After I made the configuration update, the port-forwarding just got stuck in the terminal and the service can not be accessed any more.

$ curl http://localhost
curl: (7) Failed to connect to localhost port 80: Connection refused

$ dig raycluster-sample-serve-svc

; <<>> DiG 9.16.1-Ubuntu <<>> raycluster-sample-serve-svc
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 41827
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 60229445b803277c (echoed)
;; QUESTION SECTION:
;raycluster-sample-serve-svc.   IN      A

;; Query time: 3192 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Aug 30 23:56:57 PDT 2022
;; MSG SIZE  rcvd: 68

$ dig rayservice-sample-head-svc

; <<>> DiG 9.16.1-Ubuntu <<>> rayservice-sample-head-svc
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 29734
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: b77f7e6077f17a17 (echoed)
;; QUESTION SECTION:
;rayservice-sample-head-svc.    IN      A

;; Query time: 3115 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Aug 30 23:58:29 PDT 2022
;; MSG SIZE  rcvd: 67

Here is the result for kubectl describe rayservice rayservice-sample, I hope this could give more insights of this issue

Name:         rayservice-sample
Namespace:    ray-system
Labels:       <none>
Annotations:  <none>
API Version:  ray.io/v1alpha1
Kind:         RayService
Metadata:
  Creation Timestamp:  2022-08-31T07:04:40Z
  Generation:          2
  Managed Fields:
    API Version:  ray.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:deploymentUnhealthySecondThreshold:
        f:rayClusterConfig:
          .:
          f:headGroupSpec:
            .:
            f:rayStartParams:
              .:
              f:block:
              f:dashboard-host:
              f:node-ip-address:
              f:num-cpus:
              f:object-store-memory:
              f:port:
            f:replicas:
            f:serviceType:
            f:template:
              .:
              f:metadata:
                .:
                f:annotations:
                  .:
                  f:key:
                f:labels:
                  .:
                  f:groupName:
                  f:rayCluster:
                  f:rayNodeType:
              f:spec:
                .:
                f:containers:
          f:rayVersion:
          f:workerGroupSpecs:
        f:serveConfig:
          .:
          f:deployments:
          f:importPath:
          f:runtimeEnv:
        f:serviceUnhealthySecondThreshold:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-08-31T07:04:40Z
    API Version:  ray.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:activeServiceStatus:
          .:
          f:appStatus:
            .:
            f:lastUpdateTime:
            f:status:
          f:dashboardStatus:
            .:
            f:healthLastUpdateTime:
            f:isHealthy:
            f:lastUpdateTime:
          f:rayClusterName:
          f:rayClusterStatus:
            .:
            f:availableWorkerReplicas:
            f:desiredWorkerReplicas:
            f:endpoints:
              .:
              f:client:
              f:dashboard:
              f:dashboard-agent:
              f:gcs-server:
              f:serve:
            f:lastUpdateTime:
            f:maxWorkerReplicas:
            f:minWorkerReplicas:
            f:state:
          f:serveDeploymentStatuses:
        f:pendingServiceStatus:
          .:
          f:appStatus:
          f:dashboardStatus:
          f:rayClusterStatus:
        f:serviceStatus:
    Manager:         manager
    Operation:       Update
    Time:            2022-08-31T07:09:18Z
  Resource Version:  2739428
  UID:               ee8cfb9b-bdbc-490f-b673-f06efcf1324c
Spec:
  Deployment Unhealthy Second Threshold:  300
  Ray Cluster Config:
    Head Group Spec:
      Ray Start Params:
        Block:                    true
        Dashboard - Host:         0.0.0.0
        Node - Ip - Address:      $MY_POD_IP
        Num - Cpus:               0
        Object - Store - Memory:  100000000
        Port:                     6379
      Replicas:                   1
      Service Type:               ClusterIP
      Template:
        Metadata:
          Annotations:
            Key:  value
          Labels:
            Group Name:     headgroup
            Ray Cluster:    raycluster-sample
            Ray Node Type:  head
        Spec:
          Containers:
            Env:
              Name:  MY_POD_IP
              Value From:
                Field Ref:
                  Field Path:   status.podIP
            Image:              rayproject/ray:2.0.0
            Image Pull Policy:  IfNotPresent
            Name:               ray-head
            Ports:
              Container Port:  6379
              Name:            gcs-server
              Protocol:        TCP
              Container Port:  8265
              Name:            dashboard
              Protocol:        TCP
              Container Port:  10001
              Name:            client
              Protocol:        TCP
              Container Port:  8000
              Name:            serve
              Protocol:        TCP
            Resources:
              Limits:
                Cpu:     2
                Memory:  3Gi
              Requests:
                Cpu:     2
                Memory:  3Gi
    Ray Version:         2.0.0
    Worker Group Specs:
      Group Name:    small-group
      Max Replicas:  5
      Min Replicas:  1
      Ray Start Params:
        Block:                true
        Node - Ip - Address:  $MY_POD_IP
      Replicas:               1
      Template:
        Metadata:
          Annotations:
            Key:  value
          Labels:
            Key:  value
        Spec:
          Containers:
            Env:
              Name:   RAY_DISABLE_DOCKER_CPU_WARNING
              Value:  1
              Name:   TYPE
              Value:  worker
              Name:   CPU_REQUEST
              Value From:
                Resource Field Ref:
                  Container Name:  machine-learning
                  Resource:        requests.cpu
              Name:                CPU_LIMITS
              Value From:
                Resource Field Ref:
                  Container Name:  machine-learning
                  Resource:        limits.cpu
              Name:                MEMORY_LIMITS
              Value From:
                Resource Field Ref:
                  Container Name:  machine-learning
                  Resource:        limits.memory
              Name:                MEMORY_REQUESTS
              Value From:
                Resource Field Ref:
                  Container Name:  machine-learning
                  Resource:        requests.memory
              Name:                MY_POD_NAME
              Value From:
                Field Ref:
                  Field Path:  metadata.name
              Name:            MY_POD_IP
              Value From:
                Field Ref:
                  Field Path:   status.podIP
            Image:              rayproject/ray:2.0.0
            Image Pull Policy:  IfNotPresent
            Lifecycle:
              Pre Stop:
                Exec:
                  Command:
                    /bin/sh
                    -c
                    ray stop
            Name:  machine-learning
            Ports:
              Container Port:  80
              Name:            client
              Protocol:        TCP
            Resources:
              Limits:
                Cpu:     1
                Memory:  2Gi
              Requests:
                Cpu:     500m
                Memory:  2Gi
          Init Containers:
            Command:
              sh
              -c
              until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done
            Image:  busybox:1.28
            Name:   init-myservice
  Serve Config:
    Deployments:
      Name:          MangoStand
      Num Replicas:  1
      Ray Actor Options:
        Num Cpus:   0.1
      User Config:  price: 3

      Name:          OrangeStand
      Num Replicas:  1
      Ray Actor Options:
        Num Cpus:   0.1
      User Config:  price: 2

      Name:          PearStand
      Num Replicas:  1
      Ray Actor Options:
        Num Cpus:   0.1
      User Config:  price: 1

      Name:          FruitMarket
      Num Replicas:  1
      Ray Actor Options:
        Num Cpus:    0.1
      Name:          DAGDriver
      Num Replicas:  1
      Ray Actor Options:
        Num Cpus:    0.1
      Route Prefix:  /
    Import Path:     fruit.deployment_graph
    Runtime Env:     working_dir: "https://github.com/ray-project/test_dag/archive/c620251044717ace0a4c19d766d43c5099af8a77.zip"

  Service Unhealthy Second Threshold:  300
Status:
  Active Service Status:
    App Status:
      Last Update Time:  2022-08-31T07:12:31Z
      Status:            RUNNING
    Dashboard Status:
      Health Last Update Time:  2022-08-31T07:12:31Z
      Is Healthy:               true
      Last Update Time:         2022-08-31T07:12:31Z
    Ray Cluster Name:           rayservice-sample-raycluster-k2gvf
    Ray Cluster Status:
      Available Worker Replicas:  2
      Desired Worker Replicas:    1
      Endpoints:
        Client:             10001
        Dashboard:          8265
        Dashboard - Agent:  52365
        Gcs - Server:       6379
        Serve:              8000
      Last Update Time:     2022-08-31T07:07:29Z
      Max Worker Replicas:  5
      Min Worker Replicas:  1
      State:                ready
    Serve Deployment Statuses:
      Health Last Update Time:  2022-08-31T07:12:31Z
      Last Update Time:         2022-08-31T07:12:31Z
      Name:                     MangoStand
      Status:                   HEALTHY
      Health Last Update Time:  2022-08-31T07:12:31Z
      Last Update Time:         2022-08-31T07:12:31Z
      Name:                     OrangeStand
      Status:                   HEALTHY
      Health Last Update Time:  2022-08-31T07:12:31Z
      Last Update Time:         2022-08-31T07:12:31Z
      Name:                     PearStand
      Status:                   HEALTHY
      Health Last Update Time:  2022-08-31T07:12:31Z
      Last Update Time:         2022-08-31T07:12:31Z
      Name:                     FruitMarket
      Status:                   HEALTHY
      Health Last Update Time:  2022-08-31T07:12:31Z
      Last Update Time:         2022-08-31T07:12:31Z
      Name:                     DAGDriver
      Status:                   HEALTHY
  Pending Service Status:
    App Status:
    Dashboard Status:
    Ray Cluster Status:
  Service Status:  FailedToUpdateService
Events:
  Type    Reason                       Age                 From                   Message
  ----    ------                       ----                ----                   -------
  Normal  WaitForDashboard             15m (x2 over 15m)   rayservice-controller  Service "rayservice-sample-raycluster-p6gv6-dashboard-svc" not found
  Normal  WaitForServeDeploymentReady  15m (x8 over 15m)   rayservice-controller  Put "http://rayservice-sample-raycluster-p6gv6-dashboard-svc.ray-system.svc.cluster.local:52365/api/serve/deployments/": dial tcp 10.96.113.187:52365: connect: connection refused
  Normal  WaitForServeDeploymentReady  15m (x2 over 15m)   rayservice-controller  Put "http://rayservice-sample-raycluster-p6gv6-dashboard-svc.ray-system.svc.cluster.local:52365/api/serve/deployments/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal  SubmittedServeDeployment     15m (x8 over 15m)   rayservice-controller  Controller sent API request to update Serve deployments on cluster rayservice-sample-raycluster-p6gv6
  Normal  FailedToUpdateService        15m (x5 over 15m)   rayservice-controller  Service "rayservice-sample-head-svc" is invalid: [spec.clusterIPs[0]: Invalid value: []string(nil): primary clusterIP can not be unset, spec.ipFamilies[0]: Invalid value: []core.IPFamily(nil): primary ipFamily can not be unset]
  Normal  Running                      10m (x21 over 14m)  rayservice-controller  The Serve applicaton is now running and healthy.

I see, can you try these commands in the head node pod:

wget localhost:8000
wget raycluster-sample-serve-svc:8000

Can you also share what you do to update the config of ray head? If you’re reapplying the ray service yaml, the head node might be getting terminated and then restarted again. When it restarts, the serve deployment may no longer be running. You can verify this by doing kubectl get pods after you apply the new config to see if the old pod terminates and gets replaced by a new one

@ckw017 The output of the commands: As you can see the second one just got stuck there

$ wget localhost:8000
--2022-09-01 07:58:18--  http://localhost:8000/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8000... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:8000... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
2022-09-01 07:58:18 ERROR 500: Internal Server Error.

$ wget rayservice-sample-serve-svc:8000
--2022-09-01 07:58:33--  http://rayservice-sample-serve-svc:8000/
Resolving rayservice-sample-serve-svc (rayservice-sample-serve-svc)... 10.96.126.212
Connecting to rayservice-sample-serve-svc (rayservice-sample-serve-svc)|10.96.126.212|:8000... 

I just change the config of ray head memory limits and requests in here: kuberay/ray_v1alpha1_rayservice.yaml at ad7843edd282f066bd8bce3a7dee87e19dd52913 · ray-project/kuberay · GitHub

The behaviour you described is same with my observation, the pods get updated and the services are still there when I use kubectl get svc, but either dashboard or serve service are not reachable through port-forwarding after update

Finally, it works on AKS, but I still don’t know why it didn’t work on local Kind cluster