Ray Serve not distributing load to all replicas equally

manickavela29 · June 3, 2025, 3:17am

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: rayproject/ray:2.41.0
Python version:
OS: ubuntu
Cloud/Infrastructure: aws
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected:
I have deployed a model with 10 replicas for inference, these are multiplexed with max models 2 for each replica. While simulating 100 requests, I expected all the 10 replicas to distribute the load evenly and process efficiently.
Config :

        deployments:
          - name: static_inference
            num_replicas: 10
            max_concurrent_queries: 5

Actual:
But observing from dashboard, all the 10 model replicas are not very well utilized, only 3 is totally utilized with other replicas processing very few requests time to time. I tried reducing max_concurrent_queries from 20 to 5, seeing same behaviour.

Currently I am interested in running without autoscaling, but if even distribution can only come with autoscaling, then I would go with it, can anyone suggest better?

Observations:

From Dashboard these are the comparison between active replicas and passively active replicas

Active: 
Pending tasks: 0
Executed tasks: 1581 

Passively Active: 
Pending tasks: 0
Executed tasks: 151

GPU memory utilization, 3 active replicas are there even for 200 concurrent requests, GPU memory is only mounting up for these 3 models and rest of the replicas seems not loading the model. (L40S)

Akshay_Malik · June 3, 2025, 1:31pm

hi could you provide some screenshots for the metrics you’re seeing?

manickavela29 · June 8, 2025, 10:39am

I am running a load of 300 req/sec concurrency with locust, and only 3/4/5 replicas are actively running whereas rest or not getting utilized, each inference inside ray replica would take around ~30 - 60ms.

Reading around places, one explanation I got was about intelligent routing for ray replicas, mentioning that few replicas which replies instantaneously are chosen by router and they are send upcoming requests.

Additional info, load testing is running inference for only single model with mulitplexing,
(next step is to run with multiple models)

manickavela29 · June 20, 2025, 10:13am

Hey, Figured that the issue is actually from a upstream service bottleneck.

Ray is delivering what it promised!
Thank you

Topic		Replies	Views
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	830	October 20, 2023
Ray serve autoscaling queue size Ray Serve	5	1320	May 24, 2022
Ray Serve replica level autoscaling not working with Kube deployment Ray Serve	3	26	June 11, 2025
Why Ray Serve only just use half numbers of replicas for parallelism Ray Serve	4	657	February 10, 2023
Ray autoscaling despite hard limit on number of replicas	1	47	December 6, 2024

Ray Serve not distributing load to all replicas equally

Related topics