it looks like that currently all traffic is routed to the first replica and the second is idle all the time although the first replica could use some relief as it is under heavy fire and at >90% usage.
Proxy actors forward requests to deployment replicas using a ServeHandle
. Usually that means replicas get roughly even load because the ServeHandle
performs power-of-two-choices.
As an optimization, the ServeHandles
in the proxy actor first perform power-of-two-choices across replicas only on the same node. That way, requests can be fulfilled without requiring cross-node communication. If no replicas are on the same node (or if all replicas on the same node are busy), then the proxy falls back to replicas on other nodes.
Since you’re running with a proxy only on one node, this is likely why only one of your replicas is getting traffic. It’s not being saturated (i.e. the number of ongoing requests is always lower than max_ongoing_requests
), so the proxy keeps sending that replica traffic.
Could you enable proxy actors on all nodes, and balance requests across them? That way, the traffic is spread out more evenly.
must I implement load balancing on my own, e.g., using nginx, such that also the second replica is utilized?
There are two places where load balancing should happen:
- Across proxy actors: this must be implemented outside of Serve. For example, you could use
nginx
here to balance requests across all the different proxy actors. - Across deployment replicas: this happens out-of-the-box in the
ServeHandle
. When aServeHandle
receives a request, it selects a replica using power-of-two-choices and sends the request to that replica using a Ray actor call.