Ray Serve replica level autoscaling not working with Kube deployment

High: Completely blocks me.

2. Environment:

  • Ray version: rayproject/ray:2.41.0
  • Python version:
  • OS: ubuntu
  • Cloud/Infrastructure: aws
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: As the load increases, autoscaling should get kick started and replica must increase from min 1 to max 10

  • Actual: there is no scaling up of replicas happening even if I send large amount of data

I am deploying multiplexed replicas

ray_inference.py

@serve.deployment(
    name="inference",
    ray_actor_options={"num_cpus": 0.2,
                       "num_gpus": 0.1})
class Inference:
    def __init__(self):
        self.model_path = MODEL_DIR

    @serve.multiplexed(max_num_models_per_replica=2)
    async def get_model_runtime(self, model_id):
        # Loading model

    async def __call__(self, request):
        model_id = serve.get_multiplexed_model_id()
        #  Running inference and returning response

Kubernetes yaml file used with helm for raycluster

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: inference
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  serveConfigV2: |
    applications:
      - name: inference_worker
        route_prefix: /ray_inference
        deployments:
          - name: inference
            num_replicas: auto
            max_ongoing_requests: 40
            ray_actor_options:
              num_cpus: 0.1
              num_gpus: 0.1
            autoscaling_config:
              min_replicas: 1
              max_replicas: 10
              target_ongoing_requests: 30

  rayClusterConfig:
    rayVersion: '2.41.0' # Should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
          - name: ray-head
            image:  <custom image imported from ray>
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: {{ .Values.rayservice.containerPort }}
              name: serve
            volumeMounts:
              - mountPath: /tmp/ray
                name: ray-logs
            resources:
              limits:
                cpu: 2
                memory: "4G"
              requests:
                cpu: 2
                memory: "4G"
          volumes:
            - name: ray-logs
              emptyDir: {}
    workerGroupSpecs:
    # The pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 2
      groupName: wgpu
      rayStartParams: {}
      # Pod template
      template:
        spec:            
          containers:
          - name: ray-worker
            image: <custom image imported from ray>
            lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
            resources:
              limits:
                cpu: 4
                memory: "24G"
                nvidia.com/gpu: 1
              requests:
                cpu: 4
                memory: "24G"
                nvidia.com/gpu: 1
            readinessProbe:
              exec:
                command:
                  - bash
                  - -c
                  - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
              initialDelaySeconds: 10
              timeoutSeconds: 2
              periodSeconds: 5
              failureThreshold: 5
            livenessProbe:
              exec:
                command:
                  - bash
                  - -c
                  - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
              initialDelaySeconds: 30
              timeoutSeconds: 2
              periodSeconds: 5
              failureThreshold: 5

            # Please add the following taints to the GPU node.
          tolerations:
            - key: "ray.io/node-type"
              operator: "Equal"
              value: "worker"
              effect: "NoSchedule"
            - key: "nvidia.com/gpu"
              operator: Exists
            - key: "pool"
              value: "gpu"
              operator: "Equal"
              effect: "NoSchedule"
            - key: "type"
              value: "{{ .Values.rayinference.gputype | default "a10g" }}"
              operator: "Equal"
              effect: "NoSchedule"
          nodeSelector:
            "pool": "gpu"
            "type": {{ .Values.rayinference.gputype | default "a10g" }}
          volumes:
          - name: {{ .Chart.Name }}-global
            configMap:
              name: {{ .Release.Name }}-configmap

I have tried with different config for max_ongoing_requests and target_ongoing_requests(less than the max_req) but behaviour has been the same.

As seen in the ray dashboard image, there are wanring claims of resource not being available, but resource status tell otherwise

Found this error in monitor.log file

Active:
1 node_684be43d1c1656e01e5c57c5f2f0bcd744d354d7f6183c18722de3de
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Usage:
0.0/2.0 CPU
0B/3.73GiB memory
0B/986.91MiB object_store_memory
Demands:
{‘CPU’: 0.1, ‘GPU’: 0.1}: 1+ pending tasks/actors
2025-06-11 06:52:17,312 INFO autoscaler.py:461 – The autoscaler took 0.001 seconds to complete the update iteration.
2025-06-11 06:52:22,316 INFO autoscaler.py:146 – The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2025-06-11 06:52:22,316 INFO autoscaler.py:418 –
======== Autoscaler status: 2025-06-11 06:52:22.316590 ========
Node status
Active:
1 node_684be43d1c1656e01e5c57c5f2f0bcd744d354d7f6183c18722de3de
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Usage:
0.0/2.0 CPU
0B/3.73GiB memory
0B/986.91MiB object_store_memory

should enableInTreeAutoscaling: true, be set for even replica scaling inside a worker pod? do I have to set config in kuberay operator to enable this

Both the below parameters default value was too high to trigger auto scaling of replicas inside pod, setting it to these below values fixed my issue with replica scaling

metrics_interval_s: 2
upscale_delay_s: 2

Worker pod scaling and Node availability was altogether a different issue with pod availability in cluster I think, looking into it