[High] Why doesn't parallelism work with data preprocessing?

Ray Serve uses power-of-two-choices routing. When a ServeHandle receives a request, it:

  1. Randomly chooses 2 replicas from the requested deployment
  2. Queries the number of requests that each replica is processing
  3. Sends the request to the replica that’s processing fewer requests. If both replicas are already processing max_concurrent_queries requests, then the ServeHandle picks 2 new replicas and repeats the process.

Power-of-two-choices generally does a good job of balancing load. E.g. if there’s a slow replica or a replica processing lengthy requests, power-of-two-choices naturally directs requests to other replicas while round-robin continues to send requests to the replica, which risks overloading it.

The downside is that since the 2 replicas are chosen randomly, if there’s a low number of requests and a low number of replicas, the request distribution will be a bit more uneven. How much traffic do you anticipate receiving in production?

Currently, Ray Serve doesn’t provide a way to do round-robin routing to replicas. If you’re interested in it, could you file a feature request on GitHub?

1 Like