1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: rayproject/ray:2.41.0
- Python version:
- OS: ubuntu
- Cloud/Infrastructure: aws
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected:
I have deployed a model with 10 replicas for inference, these are multiplexed with max models 2 for each replica. While simulating 100 requests, I expected all the 10 replicas to distribute the load evenly and process efficiently.
Config :
deployments:
- name: static_inference
num_replicas: 10
max_concurrent_queries: 5
- Actual:
But observing from dashboard, all the 10 model replicas are not very well utilized, only 3 is totally utilized with other replicas processing very few requests time to time. I tried reducing max_concurrent_queries from 20 to 5, seeing same behaviour.
Currently I am interested in running without autoscaling, but if even distribution can only come with autoscaling, then I would go with it, can anyone suggest better?
Observations:
- From Dashboard these are the comparison between active replicas and passively active replicas
Active:
Pending tasks: 0
Executed tasks: 1581
Passively Active:
Pending tasks: 0
Executed tasks: 151
- GPU memory utilization, 3 active replicas are there even for 200 concurrent requests, GPU memory is only mounting up for these 3 models and rest of the replicas seems not loading the model. (L40S)