How to do Load Balancing?

ppljb · August 31, 2024, 12:10am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi there,
I have a Ray cluster consisting of 5 nodes with 10 GPUs in total.
Using Ray serve and vLLM I have deployed multiple LLM models on this cluster and each LLM exposes a REST endpoint via FastAPI.
Currently, I have set proxy_location: HeadOnly.
I want to ensure high availability and robustness to varying workload.
I read already, that I probably change to run a proxy on each node in the cluster Architecture — Ray 2.34.0 and enable AutoScaling.
For now, this is not my concern and for testing purposes I manually spawned a new node to deploy one of my LLMs with 2 replicas.
This worked so far, BUT my question is whether this contains load balancing or not because it looks like that currently all traffic is routed to the first replica and the second is idle all the time although the first replica could use some relief as it is under heavy fire and at >90% usage.
Am I just unlucky with my observations or must I implement load balancing on my own, e.g., using nginx, such that also the second replica is utilized?

Unfortunately, the information I can find is somewhat contradictory:
Here you explain load balancing [High] Why doesn't parallelism work with data preprocessing? - #3 by shrekris , the docs say I can use my own load balancing on top of Ray.
Thank you very much
Best regards

shrekris · September 3, 2024, 5:31am

it looks like that currently all traffic is routed to the first replica and the second is idle all the time although the first replica could use some relief as it is under heavy fire and at >90% usage.

Proxy actors forward requests to deployment replicas using a ServeHandle. Usually that means replicas get roughly even load because the ServeHandle performs power-of-two-choices.

As an optimization, the ServeHandles in the proxy actor first perform power-of-two-choices across replicas only on the same node. That way, requests can be fulfilled without requiring cross-node communication. If no replicas are on the same node (or if all replicas on the same node are busy), then the proxy falls back to replicas on other nodes.

Since you’re running with a proxy only on one node, this is likely why only one of your replicas is getting traffic. It’s not being saturated (i.e. the number of ongoing requests is always lower than max_ongoing_requests), so the proxy keeps sending that replica traffic.

Could you enable proxy actors on all nodes, and balance requests across them? That way, the traffic is spread out more evenly.

must I implement load balancing on my own, e.g., using nginx, such that also the second replica is utilized?

There are two places where load balancing should happen:

Across proxy actors: this must be implemented outside of Serve. For example, you could use nginx here to balance requests across all the different proxy actors.
Across deployment replicas: this happens out-of-the-box in the ServeHandle. When a ServeHandle receives a request, it selects a replica using power-of-two-choices and sends the request to that replica using a Ray actor call.

ppljb · September 4, 2024, 10:49am

As an optimization, the ServeHandles in the proxy actor first perform power-of-two-choices across replicas only on the same node. That way, requests can be fulfilled without requiring cross-node communication. If no replicas are on the same node (or if all replicas on the same node are busy), then the proxy falls back to replicas on other nodes.

Ok, to be precise my 2 replicas run on different nodes.

Since you’re running with a proxy only on one node, this is likely why only one of your replicas is getting traffic. It’s not being saturated (i.e. the number of ongoing requests is always lower than max_ongoing_requests ), so the proxy keeps sending that replica traffic.

Ah I found my error, max_ongoing_requests was set to 100 by default (Ray 2.22.0)
With Ray 2.34.0 max_ongoing_requests is set to 5 and now it does the load-balancing

Could you enable proxy actors on all nodes, and balance requests across them? That way, the traffic is spread out more evenly.

What will be the advantage of enabling proxy actors on all nodes if the proxy_location: HeadOnly also handles the load balancing?

Thank you very much!
Best regards

shrekris · September 6, 2024, 3:43pm

What will be the advantage of enabling proxy actors on all nodes if the proxy_location: HeadOnly also handles the load balancing?

Running the proxy on all nodes provides fault tolerance in case the proxy on the head node becomes unavailable.

ppljb · September 10, 2024, 5:11pm

Running the proxy on all nodes provides fault tolerance in case the proxy on the head node becomes unavailable.

Alright, thank you very much.

Topic		Replies	Views
How can I do load balancing in cluster? Ray Clusters	2	1102	July 23, 2022
How does Ray load-balance Actors across Ray Workers? Ray Clusters	1	818	November 30, 2021
Autoscaling Replicas in Ray Serve Ray Serve	5	1695	March 12, 2021
Making Ray scheduler to Pack the workloads Ray Core	0	111	April 5, 2024
Ray Serve not distributing load to all replicas equally Ray Serve	2	38	June 8, 2025

How to do Load Balancing?

Related topics