How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
Hi there,
I have a Ray cluster consisting of 5 nodes with 10 GPUs in total.
Using Ray serve and vLLM I have deployed multiple LLM models on this cluster and each LLM exposes a REST endpoint via FastAPI.
Currently, I have set proxy_location: HeadOnly.
I want to ensure high availability and robustness to varying workload.
I read already, that I probably change to run a proxy on each node in the cluster Architecture — Ray 2.34.0 and enable AutoScaling.
For now, this is not my concern and for testing purposes I manually spawned a new node to deploy one of my LLMs with 2 replicas.
This worked so far, BUT my question is whether this contains load balancing or not because it looks like that currently all traffic is routed to the first replica and the second is idle all the time although the first replica could use some relief as it is under heavy fire and at >90% usage.
Am I just unlucky with my observations or must I implement load balancing on my own, e.g., using nginx, such that also the second replica is utilized?
it looks like that currently all traffic is routed to the first replica and the second is idle all the time although the first replica could use some relief as it is under heavy fire and at >90% usage.
Proxy actors forward requests to deployment replicas using a ServeHandle. Usually that means replicas get roughly even load because the ServeHandle performs power-of-two-choices.
As an optimization, the ServeHandles in the proxy actor first perform power-of-two-choices across replicas only on the same node. That way, requests can be fulfilled without requiring cross-node communication. If no replicas are on the same node (or if all replicas on the same node are busy), then the proxy falls back to replicas on other nodes.
Since you’re running with a proxy only on one node, this is likely why only one of your replicas is getting traffic. It’s not being saturated (i.e. the number of ongoing requests is always lower than max_ongoing_requests), so the proxy keeps sending that replica traffic.
Could you enable proxy actors on all nodes, and balance requests across them? That way, the traffic is spread out more evenly.
must I implement load balancing on my own, e.g., using nginx, such that also the second replica is utilized?
There are two places where load balancing should happen:
Across proxy actors: this must be implemented outside of Serve. For example, you could use nginx here to balance requests across all the different proxy actors.
Across deployment replicas: this happens out-of-the-box in the ServeHandle. When a ServeHandle receives a request, it selects a replica using power-of-two-choices and sends the request to that replica using a Ray actor call.
As an optimization, the ServeHandles in the proxy actor first perform power-of-two-choices across replicas only on the same node. That way, requests can be fulfilled without requiring cross-node communication. If no replicas are on the same node (or if all replicas on the same node are busy), then the proxy falls back to replicas on other nodes.
Ok, to be precise my 2 replicas run on different nodes.
Since you’re running with a proxy only on one node, this is likely why only one of your replicas is getting traffic. It’s not being saturated (i.e. the number of ongoing requests is always lower than max_ongoing_requests ), so the proxy keeps sending that replica traffic.
Ah I found my error, max_ongoing_requests was set to 100 by default (Ray 2.22.0)
With Ray 2.34.0 max_ongoing_requests is set to 5 and now it does the load-balancing
Could you enable proxy actors on all nodes, and balance requests across them? That way, the traffic is spread out more evenly.
What will be the advantage of enabling proxy actors on all nodes if the proxy_location: HeadOnly also handles the load balancing?