Hi Community!
I’m trying to deploy serve app using RayService and FT. Hovewer, in case of Cluster-level updates I expect that svc ClusterIP of the name defined in RayService’s metadata.name should be pointed out to a new cluster head when serve deployment becomes ready. But it does not and ClusterIP stay pointed to the old one cluster head. And the new cluster become unavaliable when old cluster shutdowned by operator.
If just start new cluster all fine, but a cluster update will break service. Operator version is 0.5.0.
Here is some examples:
- RayService definition
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
annotations:
ray.io/external-storage-namespace: ray
ray.io/ft-enabled: "true"
name: kuberay-cluster-serve
spec:
serviceUnhealthySecondThreshold: 300
deploymentUnhealthySecondThreshold: 300
serveConfig:
importPath: project.ray.services:deployment_graph
runtimeEnv: |
working_dir: ...
env_vars:
...
pip:
...
deployments:
...
rayClusterConfig:
headGroupSpec:
rayStartParams:
block: "true"
port: '6379'
dashboard-host: 0.0.0.0
metrics-export-port: "9001"
num-cpus: "0"
serviceType: ClusterIP
template:
metadata:
annotations: {}
labels: {}
spec:
affinity: {}
containers:
- env:
...
image: rayproject/ray:2.4.0-py39-cu113
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
name: ray-head
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 10001
name: client
protocol: TCP
- containerPort: 8265
name: dashboard
protocol: TCP
- containerPort: 8000
name: ray-serve
protocol: TCP
- containerPort: 52365
name: dashboard-agent
protocol: TCP
- containerPort: 9001
name: http-metrics
protocol: TCP
resources:
limits:
cpu: "4"
ephemeral-storage: 5000M
memory: 10Gi
requests:
cpu: "4"
ephemeral-storage: 5000M
memory: 10Gi
securityContext: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
imagePullSecrets: []
nodeSelector: {}
tolerations: []
volumes:
- emptyDir: {}
name: log-volume
workerGroupSpecs:
- groupName: workergroup-models
maxReplicas: 1
minReplicas: 1
rayStartParams:
block: "true"
metrics-export-port: "9001"
replicas: 1
template:
metadata:
annotations: {}
labels: {}
spec:
affinity: {}
containers:
- env:
...
image: rayproject/ray:2.4.0-py39-cu113
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
name: ray-worker
ports:
- containerPort: 9001
name: http-metrics
resources:
limits:
...
requests:
...
securityContext: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
imagePullSecrets: []
nodeSelector: {}
tolerations: []
volumes:
- emptyDir: {}
name: log-volume
Services:
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
beray-cluster-serve-raycluster-km7j7-dashboard-svc ClusterIP 10.100.100.100 <none> 52365/TCP 23h
kuberay-cluster-serve-head-svc ClusterIP 10.100.100.101 <none> 6379/TCP,10001/TCP,8265/TCP,8000/TCP,52365/TCP,9001/TCP,8080/TCP 23h
kuberay-cluster-serve-raycluster-km7j7-head-svc ClusterIP 10.100.100.102 <none> 9001/TCP,8080/TCP,6379/TCP,10001/TCP,8265/TCP,8000/TCP,52365/TCP 23h
kubectl describe svc kuberay-cluster-serve-head-svc
Name: kuberay-cluster-serve-head-svc
Namespace: recommender-v2
Labels: ray.io/identifier=kuberay-cluster-serve-head
ray.io/node-type=head
ray.io/service=kuberay-cluster-serve
Annotations: <none>
Selector: app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=kuberay-cluster-serve-raycluster-km7j7,ray.io/identifier=kuberay-cluster-serve-raycluster-km7j7-head,ray.io/node-type=head
Type: ClusterIP
IP Families: <none>
IP: 10.100.100.100
IPs: 10.100.100.100
Port: gcs-server 6379/TCP
TargetPort: 6379/TCP
Endpoints: <none>
Port: client 10001/TCP
TargetPort: 10001/TCP
Endpoints: <none>
Port: dashboard 8265/TCP
TargetPort: 8265/TCP
Endpoints: <none>
Port: ray-serve 8000/TCP
TargetPort: 8000/TCP
Endpoints: <none>
Port: dashboard-agent 52365/TCP
TargetPort: 52365/TCP
Endpoints: <none>
Port: http-metrics 9001/TCP
TargetPort: 9001/TCP
Endpoints: <none>
Port: metrics 8080/TCP
TargetPort: 8080/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
The problem is that after updating the cluster, the new cluster started successfully, but kuberay-cluster-serve-head-svc
does not point to the new cluster and retains the old selectors.
Has anyone experienced this?
How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.