How to ensure ray serve using max replicas possible

From the documentation I get this description of arguments

target_num_ongoing_requests_per_replica is the average number of ongoing requests per replica that the Serve autoscaler will try to ensure. Set this to a reasonable number (for example, 5) and adjust it based on your request processing length (the longer the requests, the smaller this number should be) as well as your latency objective (the shorter you want your latency to be, the smaller this number should be).

max_concurrent_queries (not in autoscaling config) is the maximum number of ongoing requests allowed for a replica. Set this to a value ~20-50% greater than target_num_ongoing_requests_per_replica. Note this is not part of the autoscaling config since it is relevant to all deployments, but it is important to set it relative to the target value if autoscaling is turned on for your deployment.

min_replicas is the minimum number of replicas for the deployment. Set this to 0 if there are long periods of no traffic and some extra tail latency during upscale is acceptable. Otherwise, set this to what you think you need for low traffic.

max_replicas is the maximum number of replicas for the deployment. Set this to ~20% higher than what you think you need for peak traffic.

Now here is my current config

proxy_location: EveryNode

http_options:

  host: 0.0.0.0

  port: 9000

applications:

- name: app1

  route_prefix: /doc

  import_path: inf_app:app

  runtime_env: {}

  deployments:

  - name: classifier
    max_concurrent_queries: 1
    autoscaling_config:
      min_replicas: 0
      initial_replicas: 2
      max_replicas: 6
      target_num_ongoing_requests_per_replica: 1
      metrics_interval_s: 10.0
      look_back_period_s: 30.0
      smoothing_factor: 1.0
      upscale_smoothing_factor: null
      downscale_smoothing_factor: null
      downscale_delay_s: 60.0
      upscale_delay_s: 0.00001
    ray_actor_options:
      num_cpus: 8.0
      num_gpus: 1.0

Whenever I am sending the bulk requests, even 100 together it is only using 3 replicas,

Here are my logs of ray, where it uses same PID

(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:41:45,865 document_classifier app1#document_classifier#EKWXSq 641480ea-0e38-4734-a505-8954c08ffda4 /doc_cls app1 replica.py:749 - __CALL__ OK 9467.2ms      
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                                                                                                                                                                                              
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(                                                                                                                                                               
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:41:48,961 document_classifier app1#document_classifier#EKWXSq d5d08148-d85d-4852-9a65-784251fc4eed /doc_cls app1 replica.py:749 - __CALL__ OK 2561.3ms      
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                                                                                                                                                                                              
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(                                                                                                                                                               
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:41:50,186 document_classifier app1#document_classifier#EKWXSq cd40c8de-13e2-4185-a456-5de7c2ecb67a /doc_cls app1 replica.py:749 - __CALL__ OK 759.6ms       
(ServeController pid=91373) WARNING 2023-10-19 10:41:50,558 controller 91373 deployment_state.py:1987 - Deployment 'document_classifier' in application 'app1' has 4 replicas that have taken more than 30s to be scheduled. This
 may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1.0, "GPU": 1.0}, total resources available: {"CPU": 7.0}. Use `ray status` for m
ore details.                                            
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:41:54,859 document_classifier app1#document_classifier#EKWXSq 0b342233-fe81-4f7d-a9ff-a99e2434b7be /doc_cls app1 replica.py:749 - __CALL__ OK 4410.0ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:42:06,314 document_classifier app1#document_classifier#EKWXSq 97ff92a3-ae0f-4ebb-9288-4b5d1546c050 /doc_cls app1 replica.py:749 - __CALL__ OK 10835.0ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:42:10,509 document_classifier app1#document_classifier#EKWXSq 671ba79d-87f6-4188-a3bc-dccce0a7e01a /doc_cls app1 replica.py:749 - __CALL__ OK 3975.3ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:42:13,441 document_classifier app1#document_classifier#EKWXSq fa7589ae-6f46-44e4-8dce-f46e3a8e2dbf /doc_cls app1 replica.py:749 - __CALL__ OK 2872.7ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:42:14,114 document_classifier app1#document_classifier#EKWXSq 0fa85af4-824e-4110-aa26-1a3624c15a28 /doc_cls app1 replica.py:749 - __CALL__ OK 510.9ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.

Hi, could you try setting a higher max_concurrent_queries? You need a larger value for max_concurrent_queries than target_num_ongoing_requests_per_replica otherwise the deployment will not scale up correctly. You can try a number like 3 or 5 first.

Thanks Let me check and update you