Ray Serve not distributing load to all replicas equally

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: rayproject/ray:2.41.0
  • Python version:
  • OS: ubuntu
  • Cloud/Infrastructure: aws
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected:
    I have deployed a model with 10 replicas for inference, these are multiplexed with max models 2 for each replica. While simulating 100 requests, I expected all the 10 replicas to distribute the load evenly and process efficiently.
    Config :
        deployments:
          - name: static_inference
            num_replicas: 10
            max_concurrent_queries: 5
  • Actual:
    But observing from dashboard, all the 10 model replicas are not very well utilized, only 3 is totally utilized with other replicas processing very few requests time to time. I tried reducing max_concurrent_queries from 20 to 5, seeing same behaviour.

Currently I am interested in running without autoscaling, but if even distribution can only come with autoscaling, then I would go with it, can anyone suggest better?

Observations:

  1. From Dashboard these are the comparison between active replicas and passively active replicas
Active: 
Pending tasks: 0
Executed tasks: 1581 

Passively Active: 
Pending tasks: 0
Executed tasks: 151
  1. GPU memory utilization, 3 active replicas are there even for 200 concurrent requests, GPU memory is only mounting up for these 3 models and rest of the replicas seems not loading the model. (L40S)

hi could you provide some screenshots for the metrics you’re seeing?

I am running a load of 300 req/sec concurrency with locust, and only 3/4/5 replicas are actively running whereas rest or not getting utilized, each inference inside ray replica would take around ~30 - 60ms.

Reading around places, one explanation I got was about intelligent routing for ray replicas, mentioning that few replicas which replies instantaneously are chosen by router and they are send upcoming requests.

Additional info, load testing is running inference for only single model with mulitplexing,
(next step is to run with multiple models)