Rui
August 30, 2022, 1:56pm
1
How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
Hi Community,
I use this file: kuberay/ray_v1alpha1_rayservice.yaml at master · ray-project/kuberay · GitHub to deploy a Ray cluster on local Kind cluster, I also apply an ingress and everything works fine, but after I update the config of ray head (e.g. memory resource limit), the ingress doesn’t work and shows this error: {"message":"failure to get a peer from the ring-balancer"}
. Both dashboard and ray serve can not be accessed. Here is my Ingress yaml:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
konghq.com/strip-path: "true"
name: example-ingress
spec:
ingressClassName: kong
rules:
- http:
paths:
- pathType: Prefix
path: /serve
backend:
service:
name: rayservice-sample-serve-svc
port:
number: 8000
- pathType: Prefix
path: /dashboard
backend:
service:
name: rayservice-sample-head-svc
port:
number: 8265
I deploy Ray Serve deployment with FastAPI and could access the SwaggerUI under the path /route_prefix/docs
with port forwarding, but when I use the same ingress yaml file showed above, the SwaggerUI can not be loaded.
Thanks in advance for your help!
ckw017
August 30, 2022, 8:55pm
2
Can you try running through some of the suggestions from here?: kubernetes - Error {"message":"failure to get a peer from the ring-balancer"} using kong ingress - Stack Overflow
In particular, if you’re able to reach the services through port-forwarding, can you try dig raycluster-sample-serve-svc
and dig raycluster-complete-head-svc
from the ingress pod
Rui
August 31, 2022, 7:25am
3
Hi @ckw017 I tried these commands and pasted the output below. I tried this time without ingress and found I could reach the service through port-forwarding for the first time. After I made the configuration update, the port-forwarding just got stuck in the terminal and the service can not be accessed any more.
$ curl http://localhost
curl: (7) Failed to connect to localhost port 80: Connection refused
$ dig raycluster-sample-serve-svc
; <<>> DiG 9.16.1-Ubuntu <<>> raycluster-sample-serve-svc
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 41827
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 60229445b803277c (echoed)
;; QUESTION SECTION:
;raycluster-sample-serve-svc. IN A
;; Query time: 3192 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Aug 30 23:56:57 PDT 2022
;; MSG SIZE rcvd: 68
$ dig rayservice-sample-head-svc
; <<>> DiG 9.16.1-Ubuntu <<>> rayservice-sample-head-svc
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 29734
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: b77f7e6077f17a17 (echoed)
;; QUESTION SECTION:
;rayservice-sample-head-svc. IN A
;; Query time: 3115 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Aug 30 23:58:29 PDT 2022
;; MSG SIZE rcvd: 67
Here is the result for kubectl describe rayservice rayservice-sample
, I hope this could give more insights of this issue
Name: rayservice-sample
Namespace: ray-system
Labels: <none>
Annotations: <none>
API Version: ray.io/v1alpha1
Kind: RayService
Metadata:
Creation Timestamp: 2022-08-31T07:04:40Z
Generation: 2
Managed Fields:
API Version: ray.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:deploymentUnhealthySecondThreshold:
f:rayClusterConfig:
.:
f:headGroupSpec:
.:
f:rayStartParams:
.:
f:block:
f:dashboard-host:
f:node-ip-address:
f:num-cpus:
f:object-store-memory:
f:port:
f:replicas:
f:serviceType:
f:template:
.:
f:metadata:
.:
f:annotations:
.:
f:key:
f:labels:
.:
f:groupName:
f:rayCluster:
f:rayNodeType:
f:spec:
.:
f:containers:
f:rayVersion:
f:workerGroupSpecs:
f:serveConfig:
.:
f:deployments:
f:importPath:
f:runtimeEnv:
f:serviceUnhealthySecondThreshold:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-08-31T07:04:40Z
API Version: ray.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:activeServiceStatus:
.:
f:appStatus:
.:
f:lastUpdateTime:
f:status:
f:dashboardStatus:
.:
f:healthLastUpdateTime:
f:isHealthy:
f:lastUpdateTime:
f:rayClusterName:
f:rayClusterStatus:
.:
f:availableWorkerReplicas:
f:desiredWorkerReplicas:
f:endpoints:
.:
f:client:
f:dashboard:
f:dashboard-agent:
f:gcs-server:
f:serve:
f:lastUpdateTime:
f:maxWorkerReplicas:
f:minWorkerReplicas:
f:state:
f:serveDeploymentStatuses:
f:pendingServiceStatus:
.:
f:appStatus:
f:dashboardStatus:
f:rayClusterStatus:
f:serviceStatus:
Manager: manager
Operation: Update
Time: 2022-08-31T07:09:18Z
Resource Version: 2739428
UID: ee8cfb9b-bdbc-490f-b673-f06efcf1324c
Spec:
Deployment Unhealthy Second Threshold: 300
Ray Cluster Config:
Head Group Spec:
Ray Start Params:
Block: true
Dashboard - Host: 0.0.0.0
Node - Ip - Address: $MY_POD_IP
Num - Cpus: 0
Object - Store - Memory: 100000000
Port: 6379
Replicas: 1
Service Type: ClusterIP
Template:
Metadata:
Annotations:
Key: value
Labels:
Group Name: headgroup
Ray Cluster: raycluster-sample
Ray Node Type: head
Spec:
Containers:
Env:
Name: MY_POD_IP
Value From:
Field Ref:
Field Path: status.podIP
Image: rayproject/ray:2.0.0
Image Pull Policy: IfNotPresent
Name: ray-head
Ports:
Container Port: 6379
Name: gcs-server
Protocol: TCP
Container Port: 8265
Name: dashboard
Protocol: TCP
Container Port: 10001
Name: client
Protocol: TCP
Container Port: 8000
Name: serve
Protocol: TCP
Resources:
Limits:
Cpu: 2
Memory: 3Gi
Requests:
Cpu: 2
Memory: 3Gi
Ray Version: 2.0.0
Worker Group Specs:
Group Name: small-group
Max Replicas: 5
Min Replicas: 1
Ray Start Params:
Block: true
Node - Ip - Address: $MY_POD_IP
Replicas: 1
Template:
Metadata:
Annotations:
Key: value
Labels:
Key: value
Spec:
Containers:
Env:
Name: RAY_DISABLE_DOCKER_CPU_WARNING
Value: 1
Name: TYPE
Value: worker
Name: CPU_REQUEST
Value From:
Resource Field Ref:
Container Name: machine-learning
Resource: requests.cpu
Name: CPU_LIMITS
Value From:
Resource Field Ref:
Container Name: machine-learning
Resource: limits.cpu
Name: MEMORY_LIMITS
Value From:
Resource Field Ref:
Container Name: machine-learning
Resource: limits.memory
Name: MEMORY_REQUESTS
Value From:
Resource Field Ref:
Container Name: machine-learning
Resource: requests.memory
Name: MY_POD_NAME
Value From:
Field Ref:
Field Path: metadata.name
Name: MY_POD_IP
Value From:
Field Ref:
Field Path: status.podIP
Image: rayproject/ray:2.0.0
Image Pull Policy: IfNotPresent
Lifecycle:
Pre Stop:
Exec:
Command:
/bin/sh
-c
ray stop
Name: machine-learning
Ports:
Container Port: 80
Name: client
Protocol: TCP
Resources:
Limits:
Cpu: 1
Memory: 2Gi
Requests:
Cpu: 500m
Memory: 2Gi
Init Containers:
Command:
sh
-c
until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done
Image: busybox:1.28
Name: init-myservice
Serve Config:
Deployments:
Name: MangoStand
Num Replicas: 1
Ray Actor Options:
Num Cpus: 0.1
User Config: price: 3
Name: OrangeStand
Num Replicas: 1
Ray Actor Options:
Num Cpus: 0.1
User Config: price: 2
Name: PearStand
Num Replicas: 1
Ray Actor Options:
Num Cpus: 0.1
User Config: price: 1
Name: FruitMarket
Num Replicas: 1
Ray Actor Options:
Num Cpus: 0.1
Name: DAGDriver
Num Replicas: 1
Ray Actor Options:
Num Cpus: 0.1
Route Prefix: /
Import Path: fruit.deployment_graph
Runtime Env: working_dir: "https://github.com/ray-project/test_dag/archive/c620251044717ace0a4c19d766d43c5099af8a77.zip"
Service Unhealthy Second Threshold: 300
Status:
Active Service Status:
App Status:
Last Update Time: 2022-08-31T07:12:31Z
Status: RUNNING
Dashboard Status:
Health Last Update Time: 2022-08-31T07:12:31Z
Is Healthy: true
Last Update Time: 2022-08-31T07:12:31Z
Ray Cluster Name: rayservice-sample-raycluster-k2gvf
Ray Cluster Status:
Available Worker Replicas: 2
Desired Worker Replicas: 1
Endpoints:
Client: 10001
Dashboard: 8265
Dashboard - Agent: 52365
Gcs - Server: 6379
Serve: 8000
Last Update Time: 2022-08-31T07:07:29Z
Max Worker Replicas: 5
Min Worker Replicas: 1
State: ready
Serve Deployment Statuses:
Health Last Update Time: 2022-08-31T07:12:31Z
Last Update Time: 2022-08-31T07:12:31Z
Name: MangoStand
Status: HEALTHY
Health Last Update Time: 2022-08-31T07:12:31Z
Last Update Time: 2022-08-31T07:12:31Z
Name: OrangeStand
Status: HEALTHY
Health Last Update Time: 2022-08-31T07:12:31Z
Last Update Time: 2022-08-31T07:12:31Z
Name: PearStand
Status: HEALTHY
Health Last Update Time: 2022-08-31T07:12:31Z
Last Update Time: 2022-08-31T07:12:31Z
Name: FruitMarket
Status: HEALTHY
Health Last Update Time: 2022-08-31T07:12:31Z
Last Update Time: 2022-08-31T07:12:31Z
Name: DAGDriver
Status: HEALTHY
Pending Service Status:
App Status:
Dashboard Status:
Ray Cluster Status:
Service Status: FailedToUpdateService
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForDashboard 15m (x2 over 15m) rayservice-controller Service "rayservice-sample-raycluster-p6gv6-dashboard-svc" not found
Normal WaitForServeDeploymentReady 15m (x8 over 15m) rayservice-controller Put "http://rayservice-sample-raycluster-p6gv6-dashboard-svc.ray-system.svc.cluster.local:52365/api/serve/deployments/": dial tcp 10.96.113.187:52365: connect: connection refused
Normal WaitForServeDeploymentReady 15m (x2 over 15m) rayservice-controller Put "http://rayservice-sample-raycluster-p6gv6-dashboard-svc.ray-system.svc.cluster.local:52365/api/serve/deployments/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Normal SubmittedServeDeployment 15m (x8 over 15m) rayservice-controller Controller sent API request to update Serve deployments on cluster rayservice-sample-raycluster-p6gv6
Normal FailedToUpdateService 15m (x5 over 15m) rayservice-controller Service "rayservice-sample-head-svc" is invalid: [spec.clusterIPs[0]: Invalid value: []string(nil): primary clusterIP can not be unset, spec.ipFamilies[0]: Invalid value: []core.IPFamily(nil): primary ipFamily can not be unset]
Normal Running 10m (x21 over 14m) rayservice-controller The Serve applicaton is now running and healthy.
ckw017
August 31, 2022, 10:47pm
4
I see, can you try these commands in the head node pod:
wget localhost:8000
wget raycluster-sample-serve-svc:8000
Can you also share what you do to update the config of ray head? If you’re reapplying the ray service yaml, the head node might be getting terminated and then restarted again. When it restarts, the serve deployment may no longer be running. You can verify this by doing kubectl get pods
after you apply the new config to see if the old pod terminates and gets replaced by a new one
Rui
September 1, 2022, 3:04pm
5
@ckw017 The output of the commands: As you can see the second one just got stuck there
$ wget localhost:8000
--2022-09-01 07:58:18-- http://localhost:8000/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8000... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:8000... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
2022-09-01 07:58:18 ERROR 500: Internal Server Error.
$ wget rayservice-sample-serve-svc:8000
--2022-09-01 07:58:33-- http://rayservice-sample-serve-svc:8000/
Resolving rayservice-sample-serve-svc (rayservice-sample-serve-svc)... 10.96.126.212
Connecting to rayservice-sample-serve-svc (rayservice-sample-serve-svc)|10.96.126.212|:8000...
I just change the config of ray head memory limits and requests in here: kuberay/ray_v1alpha1_rayservice.yaml at ad7843edd282f066bd8bce3a7dee87e19dd52913 · ray-project/kuberay · GitHub
The behaviour you described is same with my observation, the pods get updated and the services are still there when I use kubectl get svc
, but either dashboard or serve service are not reachable through port-forwarding after update
Rui
September 5, 2022, 3:47pm
6
Finally, it works on AKS, but I still don’t know why it didn’t work on local Kind cluster
thiyagu
December 26, 2023, 8:21pm
7
hi, did you find another way to get it running on kind / minikube on your local?