1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.43.0
- Python version: 3.12.7
- OS: linux / osx
- Cloud/Infrastructure: aws
- Other libs/tools (if relevant): locust
3. What happened vs. what you expected:
- Expected: QPS to not change with multiple concurrent users if max_ongoing_requests is set to 1
- Actual: QPS drops with multiple concurrent users
When I run this deployment and use 1 locust user to send it requests, I get about 9-10 RPS as expected.
When I run locust with 5 concurrent users, I expect the same behavior as max_ongoing_requests is set to 1 for this deployment. However, the RPS drops to 6. This issue can be observed in our production code as well with a different type of workload and is quite inexplicable. Can you please help explain / resolve this?
Deployment code -
import time
from fastapi import FastAPI
from ray import serve
app = FastAPI()
@serve.deployment(max_ongoing_requests=1)
@serve.ingress(app)
class TestDeployment:
@app.post("/invoke", name="invoke")
def invoke(self):
time.sleep(0.1)
return "Hello, world!"
deployment = TestDeployment.bind()
locust file -
from locust import HttpUser, task
class LocustUser(HttpUser):
def __init__(self, *args):
super().__init__(*args)
@task
def invoke(self):
self.client.post("/invoke")