Ray serve autoscaling queue size

plum9 · May 22, 2022, 5:30pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi,
I’ve deployed a Multi-modal ML model using Ray serve with autoscaling config. The autoscaling policies have max_concurrent_queries set to 1000. From docs, it looks like it’s indicative of queue size per replica which stores requests which are yet to processed or have given a response. I want to use this deployment for a large-scale inference where I can submit anywhere from 5000, to 1,000,000 requests in an async fashion. Currently, when I try to submit 5000 requests to the deployment (which is scaled up to 5 replicas with 1000 max_concurrent_queries) it stops accepting requests around 2000.

The head node is p3.2xlarge with 1 GPU and the scaling policies required 2 more GPU’s so it scales up. The ray_actor_options are num_cpus:2, num_gpus:0.5. I’m using the latest ray version 1.12.1

How does the ServeHandle distribute the requests among replicas?
Shouldn’t all the requests be equally divided among 5 replica queues?
Is there a limitation to max_concurrent_queries?
Is my understanding of max_concurrent_queries correct where it’s synonymous to queue per replica?

Sihan_Wang · May 23, 2022, 8:46pm

Hi,

handle will use round robin to dispatch the request to each replica, when the number of ongoing queries of one replica is greater than max_concurrent_queries. It will stop assigning more queries to the replica and try to assign the query to other one until the “busy” replica’s ongoing queries is smaller than max_concurrent_queries.
And max_concurrent_queries is equal to number of running queries on replica plus the number of queuing queries on replica

plum9 · May 23, 2022, 9:03pm

@Sihan_Wang
So in the above case, where max_concurrent_queries is set to 1000 and I’ve 5 replicas, by extension of round-robin, it should queue up all the 5000 requests, right? One request takes roughly 10 seconds to process. I’m confused as to why it stops at about 2000.

Here is a snippet from config

ray_actor_options={"num_cpus": 1, "num_gpus": 0.5},
                _autoscaling_config={
                    "min_replicas": 5,
                    "max_replicas": 5,
                    "target_num_ongoing_requests_per_replica": 100,
                },
                version="v1",
                max_concurrent_queries=1000
            )

Noticed the same behavior while min_replicas was set to 2

Sihan_Wang · May 23, 2022, 10:09pm

Hi @plum9 ,

With current information, it is difficult for me to root cause the issue.
Could you provide a simple script that I can reproduce the issue on my end? (either cpu or gpu env is good to me as long as i can reproduce the issue)

Thanks
Sihan

plum9 · May 24, 2022, 12:50am

Thanks @Sihan_Wang I’ll create a short script to replicate.
In meantime, is there a limitation on max_concurrent_queries in terms of upper limit?

plum9 · May 24, 2022, 8:12pm

I made a mistake while coming up with autoscaling policies. Closing this. Thanks Sihan!

Topic		Replies	Views
How to ensure ray serve using max replicas possible Ray Serve	3	608	October 19, 2023
Autoscaling with `max_concurrent_queries = 1` Ray Serve	2	796	May 13, 2022
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	850	October 20, 2023
Ray serve deployment is not scaling up, ongoing request is always 0 Ray Serve	1	316	April 18, 2024
Ray Serve not distributing load to all replicas equally Ray Serve	3	54	June 20, 2025

Ray serve autoscaling queue size

Related topics