Why there is no possibility to call more than 100 requests in parallel to Ray Serve?

I host very simple service:

@serve.deployment(num_replicas=1, max_concurrent_queries=500)
class MyModelDeployment:
    def __init__(self, msg: str):
        self._msg = msg

    def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}

app = MyModelDeployment.bind(msg="Hello world!")

serve.run(app, route_prefix="/")

print(requests.get("http://localhost:8000/").json()) # it works

I use to hey to test performance:
hey -n 200 -c 100 http://localhost:8000/ : works
hey -n 200 -c 200 http://localhost:8000/: [1] Get “http://localhost:8000/”: read tcp 127.0.0.1:51044->127.0.0.1:8000: read: connection reset by peer

Why there is no possibility to call more than 100 requests in parallel to Ray Serve?

This is unexpected– the replica and proxy should be able to support that load. Do the proxy or replica logs show any failures? I wonder if this might be a limitation of trying to open and sustain 200 connections on a single machine.

I do not see any errors. It process N requests and then breaks.

If there’s no errors, then this may be a limitation on the machine’s ability to open and sustain 200 connections to itself. Could you try the following experiment:

  • Run 2 client machines that run hey and 1 server machine that runs Serve.
  • Run hey with 100 clients on each client machine and send requests to the server machine.

If that succeeds, then the root cause is likely the single machine.

Can I emulate this experiment ( 2 client machines and 1 server) on a single machine (by using 1 core as a machine)?