Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale

Here is the scaling configuration I am using

@serve.deployment(ray_actor_options={"num_gpus": 1},
    max_concurrent_queries=5,
    autoscaling_config={
        "target_num_ongoing_requests_per_replica": 1,
        "min_replicas": 0,
        "initial_replicas": 0,
        "max_replicas": 5,
    })
def app_builder(args:str) -> Application:
    return Inference.bind(args["modelpath"])

The command I am running in bash

serve run inf_file:app_builder modelpath=/home/models/layout/

Could you post a reproduction of the Inference deployment?

Could you use serve status to confirm that multiple replicas are running at once?

So I checked the inference status, actually I am providing 5 max replicas as you can see in my initialisation, but only 3 working at a time, in general i am hitting 40 request at the same time and each inference takes about 2 seconds

Can you suggest me something which can make max replicas getting used at the same time

To clarify:

  • The application receives 40 QPS
  • Each inference takes 2 seconds
  • target_num_ongoing_requests_per_replica=1

With this configuration, I’d expect all 5 replicas to be running since target_num_ongoing_requests_per_replicas is 1. Does you application consistently receive 40 qps, or does it only receive that in short bursts?

You could try reducing the upscale_delay_s, so the application scales up quicker.

1 Like

Since I am using joblib to hit all 40 at the same time, I don’t think it is exactly the same time, the time differs, and maybe that is causing the issue, I will check with upscale_delay_s and downscale_delay_s once and let you know.

So, I checked As you suggested, Played around with the number in upscale_delay_s, and it seems like the max number that I am writing in the config, Ray is trying to always keep less than that, For example when I wrote max_replicas and max_concurrent_queries as 8, I was able to see 4 replicas and 4 request getting executed at the same time, but that’s not the case when I write 5, also
Here is my current config

@serve.deployment(ray_actor_options={"num_gpus": 1},
    max_concurrent_queries=5,
    autoscaling_config={
        "target_num_ongoing_requests_per_replica": 1,
        "min_replicas": 0,
        "initial_replicas": 2,
        "max_replicas": 5,
        "upscale_delay_s": 0.1,
        "downscale_delay_s": 10
    })

Still, even though I am providing it with 5 replicas, it only creates 3 replicas, Am I doing something wrong?

For hitting the API I am using this code

import asyncio
import aiohttp
import datetime

async def send_req(image):
    image_path = f"storage/{image}"
    async with aiohttp.ClientSession() as session:
        response = await session.get("URL of API", data=image_path)
        result = await response.text()
        now = datetime.datetime.now()
        print(f"Result for {image}: {result}", now.minute, now.second)

async def main():
    #images = ["image1.jpg", "image2.jpg", "image3.jpg", "image4.jpg", "image5.jpg", "image6.jpg", "image7.jpg", "image8.jpg", "image9.jpg", "image10.jpg"]
    tasks = [send_req(i) for i in image]
    await asyncio.gather(*tasks)

@shrekris , here are my logs of ray server, the issue of executing is still there

(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:41:45,865 document_classifier app1#document_classifier#EKWXSq 641480ea-0e38-4734-a505-8954c08ffda4 /doc_cls app1 replica.py:749 - __CALL__ OK 9467.2ms      
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                                                                                                                                                                                              
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(                                                                                                                                                               
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:41:48,961 document_classifier app1#document_classifier#EKWXSq d5d08148-d85d-4852-9a65-784251fc4eed /doc_cls app1 replica.py:749 - __CALL__ OK 2561.3ms      
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                                                                                                                                                                                              
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(                                                                                                                                                               
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:41:50,186 document_classifier app1#document_classifier#EKWXSq cd40c8de-13e2-4185-a456-5de7c2ecb67a /doc_cls app1 replica.py:749 - __CALL__ OK 759.6ms       
(ServeController pid=91373) WARNING 2023-10-19 10:41:50,558 controller 91373 deployment_state.py:1987 - Deployment 'document_classifier' in application 'app1' has 4 replicas that have taken more than 30s to be scheduled. This
 may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1.0, "GPU": 1.0}, total resources available: {"CPU": 7.0}. Use `ray status` for m
ore details.                                            
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:41:54,859 document_classifier app1#document_classifier#EKWXSq 0b342233-fe81-4f7d-a9ff-a99e2434b7be /doc_cls app1 replica.py:749 - __CALL__ OK 4410.0ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:42:06,314 document_classifier app1#document_classifier#EKWXSq 97ff92a3-ae0f-4ebb-9288-4b5d1546c050 /doc_cls app1 replica.py:749 - __CALL__ OK 10835.0ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:42:10,509 document_classifier app1#document_classifier#EKWXSq 671ba79d-87f6-4188-a3bc-dccce0a7e01a /doc_cls app1 replica.py:749 - __CALL__ OK 3975.3ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:42:13,441 document_classifier app1#document_classifier#EKWXSq fa7589ae-6f46-44e4-8dce-f46e3a8e2dbf /doc_cls app1 replica.py:749 - __CALL__ OK 2872.7ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.                                                     
(ServeReplica:app1:document_classifier pid=91437)   warnings.warn(
(ServeReplica:app1:document_classifier pid=91437) INFO 2023-10-19 10:42:14,114 document_classifier app1#document_classifier#EKWXSq 0fa85af4-824e-4110-aa26-1a3624c15a28 /doc_cls app1 replica.py:749 - __CALL__ OK 510.9ms
(ServeReplica:app1:document_classifier pid=91437) /home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transforme
rs.

The logs show this statement:

(ServeController pid=91373) WARNING 2023-10-19 10:41:50,558 controller 91373 deployment_state.py:1987 - Deployment 'document_classifier' in application 'app1' has 4 replicas that have taken more than 30s to be scheduled. This
 may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1.0, "GPU": 1.0}, total resources available: {"CPU": 7.0}. Use `ray status` for m
ore details.

It sounds like your cluster doesn’t have enough GPUs to support more replicas than were running. There’s two layers of autoscaling to keep in mind when using Ray Serve: the Ray Serve autoscaler and the Ray autoscaler.

The Ray Serve autoscaler manages the number of deployment replicas. When Ray Serve decides to scale up, its autoscaler asks the Ray cluster to start new deployment replicas, which are Ray actors. Each replica requests the resources specified in the ray_actor_options. From your code example, each replica requests 1 GPU.

Ray then attempts to start the replicas. Once there are not enough resources to start new replicas, the Ray autoscaler then attempts to start new Ray nodes with the requested resources, so it can start the remaining replicas.

The log above is saying that Ray Serve has waited at least 30 seconds for 4 new replicas to start. But they aren’t starting because there aren’t any GPUs left in the cluster. Could you provide a larger Ray cluster, or use fewer GPUs for each replica?

1 Like

@shrekris I have on Tesla T4 and overall API takes around 2.5 GB so I still have 13 GB free in GPU, In that case, number of replicas should take place

can I keep a number less than 1 in num_of_gpus? for example 0.2, so all 5 can divide the GPU properly

can I keep a number less than 1 in num_of_gpus? for example 0.2, so all 5 can divide the GPU properly

Yes :slight_smile:

Ray Serve (and Ray) support fractional GPUs, so you can set it to 0.2. That’ll allow up to 5 replicas to be schedule on one node with a single GPU. Keep in mind that there’s no physical isolation in this case. All the replicas can see any resources on that node.

1 Like

@shrekris Thanks for the solution it works for me now, One more doubt, In the config

    ray_actor_options:
      num_cpus: 2.0
      num_gpus: 0.2

I have 8 cores in my CPU, so if I put 8 here, all the 8 cores will be used by the 1 replica, so should I put 2 here 4 replicas can use 2 -2 cores per head?

Glad to hear it’s working!

I have 8 cores in my CPU, so if I put 8 here, all the 8 cores will be used by the 1 replica, so should I put 2 here 4 replicas can use 2 -2 cores per head?

Sort of– these resources are “logical” resources in Ray, which means Ray doesn’t actually assign replicas to specific cores. Instead, it looks for nodes with enough CPUs to run the replica, and assigns the replica to those nodes. The replica can any number of cores that it needs to on that node.

For example, if you want to run 4 replicas on nodes with 8 CPUs, then yes you should set num_cpus to 2. 4 replicas will get scheduled to each node by Ray. But once they’re scheduled, they can use any number of cores on the node. Ray doesn’t enforce any physical isolation.

1 Like