Non-linear throughput when scaling Ray Serve replicas

christina · August 7, 2025, 7:56pm

Sublinear throughput in Ray Serve when increasing replicas has happened before and can be caused by several things. Ray Serve’s request routing (power-of-two-choices), backoff mechanisms to avoid overloading replicas, and potential bottlenecks in data serialization/deserialization or network transfer, especially with large payloads.

See the following discussions/docs: Ray Serve Performance Tuning, GitHub Issue #52609, GitHub Issue #52745.

Kind of! If your requests include very big payloads, the serialization + network transfer between the client, proxy, and replicas can become a bottleneck. This overhead increases with the number of replicas, especially if they are on different nodes, and can limit throughput.

Do you know how big your payloads are?

Topic		Replies	Views
Low througput and not able to scale with ray serve Ray Serve	1	59	May 6, 2025
Ray Serve not distributing load to all replicas equally Ray Serve	4	127	September 19, 2025
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	978	October 20, 2023
Ray multiplexing for higher concurrency	1	25	October 27, 2025
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	1074	January 13, 2022

Non-linear throughput when scaling Ray Serve replicas

Related topics