Ray Serve replica level autoscaling not working with Kube deployment

manickavela29 · June 11, 2025, 9:55am

High: Completely blocks me.

2. Environment:

Ray version: rayproject/ray:2.41.0
Python version:
OS: ubuntu
Cloud/Infrastructure: aws
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: As the load increases, autoscaling should get kick started and replica must increase from min 1 to max 10
Actual: there is no scaling up of replicas happening even if I send large amount of data

I am deploying multiplexed replicas

ray_inference.py

@serve.deployment(
    name="inference",
    ray_actor_options={"num_cpus": 0.2,
                       "num_gpus": 0.1})
class Inference:
    def __init__(self):
        self.model_path = MODEL_DIR

    @serve.multiplexed(max_num_models_per_replica=2)
    async def get_model_runtime(self, model_id):
        # Loading model

    async def __call__(self, request):
        model_id = serve.get_multiplexed_model_id()
        #  Running inference and returning response

Kubernetes yaml file used with helm for raycluster

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: inference
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  serveConfigV2: |
    applications:
      - name: inference_worker
        route_prefix: /ray_inference
        deployments:
          - name: inference
            num_replicas: auto
            max_ongoing_requests: 40
            ray_actor_options:
              num_cpus: 0.1
              num_gpus: 0.1
            autoscaling_config:
              min_replicas: 1
              max_replicas: 10
              target_ongoing_requests: 30

  rayClusterConfig:
    rayVersion: '2.41.0' # Should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
          - name: ray-head
            image:  <custom image imported from ray>
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: {{ .Values.rayservice.containerPort }}
              name: serve
            volumeMounts:
              - mountPath: /tmp/ray
                name: ray-logs
            resources:
              limits:
                cpu: 2
                memory: "4G"
              requests:
                cpu: 2
                memory: "4G"
          volumes:
            - name: ray-logs
              emptyDir: {}
    workerGroupSpecs:
    # The pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 2
      groupName: wgpu
      rayStartParams: {}
      # Pod template
      template:
        spec:            
          containers:
          - name: ray-worker
            image: <custom image imported from ray>
            lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
            resources:
              limits:
                cpu: 4
                memory: "24G"
                nvidia.com/gpu: 1
              requests:
                cpu: 4
                memory: "24G"
                nvidia.com/gpu: 1
            readinessProbe:
              exec:
                command:
                  - bash
                  - -c
                  - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
              initialDelaySeconds: 10
              timeoutSeconds: 2
              periodSeconds: 5
              failureThreshold: 5
            livenessProbe:
              exec:
                command:
                  - bash
                  - -c
                  - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
              initialDelaySeconds: 30
              timeoutSeconds: 2
              periodSeconds: 5
              failureThreshold: 5

            # Please add the following taints to the GPU node.
          tolerations:
            - key: "ray.io/node-type"
              operator: "Equal"
              value: "worker"
              effect: "NoSchedule"
            - key: "nvidia.com/gpu"
              operator: Exists
            - key: "pool"
              value: "gpu"
              operator: "Equal"
              effect: "NoSchedule"
            - key: "type"
              value: "{{ .Values.rayinference.gputype | default "a10g" }}"
              operator: "Equal"
              effect: "NoSchedule"
          nodeSelector:
            "pool": "gpu"
            "type": {{ .Values.rayinference.gputype | default "a10g" }}
          volumes:
          - name: {{ .Chart.Name }}-global
            configMap:
              name: {{ .Release.Name }}-configmap

I have tried with different config for max_ongoing_requests and target_ongoing_requests(less than the max_req) but behaviour has been the same.

As seen in the ray dashboard image, there are wanring claims of resource not being available, but resource status tell otherwise

manickavela29 · June 11, 2025, 2:39pm

Found this error in monitor.log file

Active:
1 node_684be43d1c1656e01e5c57c5f2f0bcd744d354d7f6183c18722de3de
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Usage:
0.0/2.0 CPU
0B/3.73GiB memory
0B/986.91MiB object_store_memory
Demands:
{‘CPU’: 0.1, ‘GPU’: 0.1}: 1+ pending tasks/actors
2025-06-11 06:52:17,312 INFO autoscaler.py:461 – The autoscaler took 0.001 seconds to complete the update iteration.
2025-06-11 06:52:22,316 INFO autoscaler.py:146 – The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2025-06-11 06:52:22,316 INFO autoscaler.py:418 –
======== Autoscaler status: 2025-06-11 06:52:22.316590 ========
Node status
Active:
1 node_684be43d1c1656e01e5c57c5f2f0bcd744d354d7f6183c18722de3de
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Usage:
0.0/2.0 CPU
0B/3.73GiB memory
0B/986.91MiB object_store_memory

manickavela29 · June 11, 2025, 5:23pm

should enableInTreeAutoscaling: true, be set for even replica scaling inside a worker pod? do I have to set config in kuberay operator to enable this

manickavela29 · June 11, 2025, 8:07pm

Both the below parameters default value was too high to trigger auto scaling of replicas inside pod, setting it to these below values fixed my issue with replica scaling

metrics_interval_s: 2
upscale_delay_s: 2

Worker pod scaling and Node availability was altogether a different issue with pod availability in cluster I think, looking into it

Topic		Replies	Views
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	728	November 20, 2023
RayServe Autoscaling not creating Ray Pods Ray Serve	3	290	March 29, 2024
Error Scaling Ray Serve to 2 Replicas Ray Serve	11	1454	August 11, 2021
Scaling Ray Serve efficiently Ray Serve	0	48	December 10, 2024
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	846	October 20, 2023

Ray Serve replica level autoscaling not working with Kube deployment

Related topics