Ray serve autoscaling queue size

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’ve deployed a Multi-modal ML model using Ray serve with autoscaling config. The autoscaling policies have max_concurrent_queries set to 1000. From docs, it looks like it’s indicative of queue size per replica which stores requests which are yet to processed or have given a response. I want to use this deployment for a large-scale inference where I can submit anywhere from 5000, to 1,000,000 requests in an async fashion. Currently, when I try to submit 5000 requests to the deployment (which is scaled up to 5 replicas with 1000 max_concurrent_queries) it stops accepting requests around 2000.

The head node is p3.2xlarge with 1 GPU and the scaling policies required 2 more GPU’s so it scales up. The ray_actor_options are num_cpus:2, num_gpus:0.5. I’m using the latest ray version 1.12.1

How does the ServeHandle distribute the requests among replicas?
Shouldn’t all the requests be equally divided among 5 replica queues?
Is there a limitation to max_concurrent_queries?
Is my understanding of max_concurrent_queries correct where it’s synonymous to queue per replica?


handle will use round robin to dispatch the request to each replica, when the number of ongoing queries of one replica is greater than max_concurrent_queries. It will stop assigning more queries to the replica and try to assign the query to other one until the “busy” replica’s ongoing queries is smaller than max_concurrent_queries.
And max_concurrent_queries is equal to number of running queries on replica plus the number of queuing queries on replica

So in the above case, where max_concurrent_queries is set to 1000 and I’ve 5 replicas, by extension of round-robin, it should queue up all the 5000 requests, right? One request takes roughly 10 seconds to process. I’m confused as to why it stops at about 2000.

Here is a snippet from config

ray_actor_options={"num_cpus": 1, "num_gpus": 0.5},
                    "min_replicas": 5,
                    "max_replicas": 5,
                    "target_num_ongoing_requests_per_replica": 100,

Noticed the same behavior while min_replicas was set to 2

Hi @plum9 ,

With current information, it is difficult for me to root cause the issue.
Could you provide a simple script that I can reproduce the issue on my end? (either cpu or gpu env is good to me as long as i can reproduce the issue)


Thanks @Sihan_Wang I’ll create a short script to replicate.
In meantime, is there a limitation on max_concurrent_queries in terms of upper limit?

I made a mistake while coming up with autoscaling policies. Closing this. Thanks Sihan!