Ray Serve uses power-of-two-choices routing. When a ServeHandle
receives a request, it:
- Randomly chooses 2 replicas from the requested deployment
- Queries the number of requests that each replica is processing
- Sends the request to the replica that’s processing fewer requests. If both replicas are already processing
max_concurrent_queries
requests, then theServeHandle
picks 2 new replicas and repeats the process.
Power-of-two-choices generally does a good job of balancing load. E.g. if there’s a slow replica or a replica processing lengthy requests, power-of-two-choices naturally directs requests to other replicas while round-robin continues to send requests to the replica, which risks overloading it.
The downside is that since the 2 replicas are chosen randomly, if there’s a low number of requests and a low number of replicas, the request distribution will be a bit more uneven. How much traffic do you anticipate receiving in production?
Currently, Ray Serve doesn’t provide a way to do round-robin routing to replicas. If you’re interested in it, could you file a feature request on GitHub?