Kuberay RayService CR does not update ClusterIP svc

Hi Community!

I’m trying to deploy serve app using RayService and FT. Hovewer, in case of Cluster-level updates I expect that svc ClusterIP of the name defined in RayService’s metadata.name should be pointed out to a new cluster head when serve deployment becomes ready. But it does not and ClusterIP stay pointed to the old one cluster head. And the new cluster become unavaliable when old cluster shutdowned by operator.

If just start new cluster all fine, but a cluster update will break service. Operator version is 0.5.0.

Here is some examples:

  • RayService definition
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  annotations:
    ray.io/external-storage-namespace: ray
    ray.io/ft-enabled: "true"
  name: kuberay-cluster-serve
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  serveConfig: 
    importPath: project.ray.services:deployment_graph
    runtimeEnv: |
      working_dir: ...
      env_vars:
        ...
      pip:
	  	...
    deployments:
      ...
	  
  rayClusterConfig:
    headGroupSpec:
      rayStartParams:
        block: "true"
        port: '6379'
        dashboard-host: 0.0.0.0
        metrics-export-port: "9001"
        num-cpus: "0"
      serviceType: ClusterIP
      template:
        metadata:
          annotations: {}
          labels: {}
        spec:
          affinity: {}
          containers:
            - env:
				...
              image: rayproject/ray:2.4.0-py39-cu113
              imagePullPolicy: IfNotPresent
              lifecycle:
                preStop:
                  exec:
                    command:
                      - /bin/sh
                      - -c
                      - ray stop
              name: ray-head
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 10001
                  name: client
                  protocol: TCP
                - containerPort: 8265
                  name: dashboard
                  protocol: TCP
                - containerPort: 8000
                  name: ray-serve
                  protocol: TCP
                - containerPort: 52365
                  name: dashboard-agent
                  protocol: TCP
                - containerPort: 9001
                  name: http-metrics
                  protocol: TCP
              resources:
                limits:
                  cpu: "4"
                  ephemeral-storage: 5000M
                  memory: 10Gi
                requests:
                  cpu: "4"
                  ephemeral-storage: 5000M
                  memory: 10Gi
              securityContext: {}
              volumeMounts:
              - mountPath: /tmp/ray
                name: log-volume
          imagePullSecrets: []
          nodeSelector: {}
          tolerations: []
          volumes:
          - emptyDir: {}
            name: log-volume
    workerGroupSpecs:
      - groupName: workergroup-models
        maxReplicas: 1
        minReplicas: 1
        rayStartParams:
          block: "true"
          metrics-export-port: "9001"
        replicas: 1
        template:
          metadata:
            annotations: {}
            labels: {}
          spec:
            affinity: {}
            containers:
              - env:
			  		...
                image: rayproject/ray:2.4.0-py39-cu113
                imagePullPolicy: IfNotPresent
                lifecycle:
                  preStop:
                    exec:
                      command:
                        - /bin/sh
                        - -c
                        - ray stop
                name: ray-worker
                ports:
                  - containerPort: 9001
                    name: http-metrics
                resources:
                  limits:
				  	...
                  requests:
				  	...
                securityContext: {}
                volumeMounts:
                - mountPath: /tmp/ray
                  name: log-volume
            imagePullSecrets: []
            nodeSelector: {}
            tolerations: []
            volumes:
            - emptyDir: {}
              name: log-volume

Services:

kubectl get svc
NAME                                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                            AGE
beray-cluster-serve-raycluster-km7j7-dashboard-svc   ClusterIP   10.100.100.100   <none>        52365/TCP                                                          23h
kuberay-cluster-serve-head-svc                       ClusterIP   10.100.100.101   <none>        6379/TCP,10001/TCP,8265/TCP,8000/TCP,52365/TCP,9001/TCP,8080/TCP   23h
kuberay-cluster-serve-raycluster-km7j7-head-svc      ClusterIP   10.100.100.102   <none>        9001/TCP,8080/TCP,6379/TCP,10001/TCP,8265/TCP,8000/TCP,52365/TCP   23h
kubectl describe svc kuberay-cluster-serve-head-svc
Name:              kuberay-cluster-serve-head-svc
Namespace:         recommender-v2
Labels:            ray.io/identifier=kuberay-cluster-serve-head
                   ray.io/node-type=head
                   ray.io/service=kuberay-cluster-serve
Annotations:       <none>
Selector:          app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=kuberay-cluster-serve-raycluster-km7j7,ray.io/identifier=kuberay-cluster-serve-raycluster-km7j7-head,ray.io/node-type=head
Type:              ClusterIP
IP Families:       <none>
IP:                10.100.100.100
IPs:               10.100.100.100
Port:              gcs-server  6379/TCP
TargetPort:        6379/TCP
Endpoints:         <none>
Port:              client  10001/TCP
TargetPort:        10001/TCP
Endpoints:         <none>
Port:              dashboard  8265/TCP
TargetPort:        8265/TCP
Endpoints:         <none>
Port:              ray-serve  8000/TCP
TargetPort:        8000/TCP
Endpoints:         <none>
Port:              dashboard-agent  52365/TCP
TargetPort:        52365/TCP
Endpoints:         <none>
Port:              http-metrics  9001/TCP
TargetPort:        9001/TCP
Endpoints:         <none>
Port:              metrics  8080/TCP
TargetPort:        8080/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

The problem is that after updating the cluster, the new cluster started successfully, but kuberay-cluster-serve-head-svc does not point to the new cluster and retains the old selectors.

Has anyone experienced this?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

@jamm1985

@Kai-Hsun_Chen any response or idea?

1 Like

Hi @jamm1985,

Would you mind opening an issue in the KubeRay repository? I cannot reproduce the issue by the following instructions:

# Step 0: Prepare a Kubernetes cluster
kind create cluster --image=kindest/node:v1.23.0

# Step 1: Install a KubeRay operator
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0

# Step 2: Create a RayService
# path: ray-operator/config/samples
kubectl apply -f ray_v1alpha1_rayservice.yaml

# Step 3: Edit `spec.rayClusterConfig.rayVersion` from 2.4.0 to 2.100.0.
kubectl edit rayservices.ray.io rayservice-sample

# Step 4: Wait for the serve deployments on the new RayCluster becoming ready.
# Check the service's selector
kubectl describe svc rayservice-sample-head-svc
1 Like

Hi @Kai-Hsun_Chen :

Thanks for quick response!

While I’m working on the reproducible snippet, maybe this error will give you some ideas:

2023-05-23T10:33:44.571Z	ERROR	controllers.RayService	raySvc Update error!	{"raySvc.Error": "Service \"kuberay-cluster-train-head-svc\" is invalid: spec.clusterIPs[0]: Invalid value: []string(nil): primary clusterIP can not be unset", "error": "Service \"kuberay-cluster-train-head-svc\" is invalid: spec.clusterIPs[0]: Invalid value: []string(nil): primary clusterIP can not be unset"}

It is operator pod under k8s v1.20.7.

The service is actually has ClusterIPs section:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2023-05-21T14:06:19Z"
  labels:
    ray.io/identifier: kuberay-cluster-serve-head
    ray.io/node-type: head
    ray.io/service: kuberay-cluster-serve
  name: kuberay-cluster-serve-head-svc
  namespace: project
  ownerReferences:
  - apiVersion: ray.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RayService
    name: kuberay-cluster-serve
    uid: f920cfe2-c031-4fce-ab0c-50ef57d7d217
  resourceVersion: "598159003"
  uid: a60c6ec7-27dc-4623-9f9a-2e091330c838
spec:
  clusterIP: 10.105.159.150
  clusterIPs:
  - 10.105.159.150
  ports:
  - appProtocol: tcp
    name: gcs-server
    port: 6379
    protocol: TCP
    targetPort: 6379
  - appProtocol: tcp
    name: client
    port: 10001
    protocol: TCP
    targetPort: 10001
  - appProtocol: tcp
    name: dashboard
    port: 8265
    protocol: TCP
    targetPort: 8265
  - appProtocol: tcp
    name: ray-serve
    port: 8000
    protocol: TCP
    targetPort: 8000
  - appProtocol: tcp
    name: dashboard-agent
    port: 52365
    protocol: TCP
    targetPort: 52365
  - appProtocol: tcp
    name: http-metrics
    port: 9001
    protocol: TCP
    targetPort: 9001
  - appProtocol: tcp
    name: metrics
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app.kubernetes.io/created-by: kuberay-operator
    app.kubernetes.io/name: kuberay
    ray.io/cluster: kuberay-cluster-serve-raycluster-km7j7
    ray.io/identifier: kuberay-cluster-serve-raycluster-km7j7-head
    ray.io/node-type: head
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

@Kai-Hsun_Chen

I figured out it affects k8s 1.20.7 version and not 1.23.