QPS drop with multiple locust users

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.43.0
  • Python version: 3.12.7
  • OS: linux / osx
  • Cloud/Infrastructure: aws
  • Other libs/tools (if relevant): locust

3. What happened vs. what you expected:

  • Expected: QPS to not change with multiple concurrent users if max_ongoing_requests is set to 1
  • Actual: QPS drops with multiple concurrent users

When I run this deployment and use 1 locust user to send it requests, I get about 9-10 RPS as expected.

When I run locust with 5 concurrent users, I expect the same behavior as max_ongoing_requests is set to 1 for this deployment. However, the RPS drops to 6. This issue can be observed in our production code as well with a different type of workload and is quite inexplicable. Can you please help explain / resolve this?

Deployment code -

import time
from fastapi import FastAPI
from ray import serve

app = FastAPI()

@serve.deployment(max_ongoing_requests=1)
@serve.ingress(app)
class TestDeployment:
    @app.post("/invoke", name="invoke")
    def invoke(self):
        time.sleep(0.1)
        return "Hello, world!"


deployment = TestDeployment.bind()

locust file -

from locust import HttpUser, task

class LocustUser(HttpUser):

    def __init__(self, *args):
        super().__init__(*args)

    @task
    def invoke(self):
        self.client.post("/invoke")