Non-linear throughput when scaling Ray Serve replicas

Sublinear throughput in Ray Serve when increasing replicas has happened before and can be caused by several things. Ray Serve’s request routing (power-of-two-choices), backoff mechanisms to avoid overloading replicas, and potential bottlenecks in data serialization/deserialization or network transfer, especially with large payloads.

See the following discussions/docs: Ray Serve Performance Tuning, GitHub Issue #52609, GitHub Issue #52745.

Kind of! If your requests include very big payloads, the serialization + network transfer between the client, proxy, and replicas can become a bottleneck. This overhead increases with the number of replicas, especially if they are on different nodes, and can limit throughput.

Do you know how big your payloads are?

1 Like