Sublinear throughput in Ray Serve when increasing replicas has happened before and can be caused by several things. Ray Serve’s request routing (power-of-two-choices), backoff mechanisms to avoid overloading replicas, and potential bottlenecks in data serialization/deserialization or network transfer, especially with large payloads.
See the following discussions/docs: Ray Serve Performance Tuning, GitHub Issue #52609, GitHub Issue #52745.
Kind of! If your requests include very big payloads, the serialization + network transfer between the client, proxy, and replicas can become a bottleneck. This overhead increases with the number of replicas, especially if they are on different nodes, and can limit throughput.
Do you know how big your payloads are?