[High] Why doesn't parallelism work with data preprocessing?

shrekris · November 30, 2023, 7:43pm

Ray Serve uses power-of-two-choices routing. When a ServeHandle receives a request, it:

Randomly chooses 2 replicas from the requested deployment
Queries the number of requests that each replica is processing
Sends the request to the replica that’s processing fewer requests. If both replicas are already processing max_concurrent_queries requests, then the ServeHandle picks 2 new replicas and repeats the process.

Power-of-two-choices generally does a good job of balancing load. E.g. if there’s a slow replica or a replica processing lengthy requests, power-of-two-choices naturally directs requests to other replicas while round-robin continues to send requests to the replica, which risks overloading it.

The downside is that since the 2 replicas are chosen randomly, if there’s a low number of requests and a low number of replicas, the request distribution will be a bit more uneven. How much traffic do you anticipate receiving in production?

Currently, Ray Serve doesn’t provide a way to do round-robin routing to replicas. If you’re interested in it, could you file a feature request on GitHub?

Topic		Replies	Views
Parallelize TorchTrainer + Preprocessor + Training?	1	215	October 27, 2023
Ray pytorch model partition Ray Core	1	29	October 31, 2024
When I make multiple concurrent requests, the program reports an error, but there is no problem with a single program	1	32	February 19, 2025
Failed to run preprocessors excample using Ray Client	0	402	December 13, 2023
Correctly sizing preprocessing Actor in Ray data Ray Data	3	73	June 26, 2024

[High] Why doesn't parallelism work with data preprocessing?

Related topics