Non-linear throughput when scaling Ray Serve replicas

manickavela29 · September 19, 2025, 2:50pm

Facing similar issues while scaling above 10 GPU worker replicas (they are all in different nodes) with each GPU worker node is having 6 model replica in L4 GPU(6 was giving expected latency for throughput initally), as the concurrency is increasing seeing a huge latency fluctuations and throughput drops
tuning max_ongoing_requests to 2 helped save some performance but as I scale concurrency above 200 seeing a huge drop in latency and increase in throughput,
any advice on workaround for backoff mechanism for higher throughput load.

Ref, my original issue

Topic		Replies	Views
Low througput and not able to scale with ray serve Ray Serve	1	46	May 6, 2025
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	1022	January 13, 2022
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	923	October 20, 2023
Not sure how num_replicas works Ray Serve	5	1746	March 4, 2021
Autoscaling Replicas in Ray Serve Ray Serve	5	1725	March 12, 2021

Non-linear throughput when scaling Ray Serve replicas

Related topics