We are deploying an inference cluster using Serve on AWS, and are starting to distribute requests among all workers’ own http_proxy servers. Requests are routed among all workers/head using our own external AWS load balancer. This load balancer is rather ‘stupid’ and has no context of the number of ongoing requests in each worker node. When a request arrives at a worker nodes http_proxy, is it further routed to another worker? How do all of these http_servers know which workers to route to?
Short answer is the http_proxy on the worker which LB routed to will randomly pick a replica to route the request. It can be on any worker/ head as long as they still accept requests.
http_location tells Ray where to start http_proxy. If it’s set to all, then all nodes regardless of head or worker can accept requests on their http port. Once a request is hitting a node, the http_proxy has the state of where is which replicas running. By default, there is this power of two choices algorithm, that basically just take two replicas and send the traffic to the one with less request queue length. You can read more https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/router.py#L263