Ray Serve not distributing load to all replicas equally

manickavela29 · September 19, 2025, 2:53pm

The upstram is fixed but ray serve is having a huge drop with increase in concurrency beyond 250, currently having 10 worker GPU nodes, with 6 model replica each, so totally 60 concurrent model serving. but latency is spiking and throughput is stuck, not improving with more replicas

Another ref: Non-linear throughput when scaling Ray Serve replicas - #2 by christina

Topic		Replies	Views
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	980	October 20, 2023
Non-linear throughput when scaling Ray Serve replicas Ray Serve	3	114	September 19, 2025
Ray multiplexing for higher concurrency	1	28	October 27, 2025
How to ensure ray serve using max replicas possible Ray Serve	3	713	October 19, 2023
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	1075	January 13, 2022

Ray Serve not distributing load to all replicas equally

Related topics