Ray serve latency overhead

I have tried to check the ray serve latency locally by creating a backend with a do-nothing function. It turns out the latency overhead is about 7-10ms. But I saw the official doc said it was about 1-2ms. Any suggestion to achieve 1-2ms latency overhead on ray serve? Thanks.

def backend_test(request):
    return {"test1": "test1"}

cc @simon-mo Can you address this question?

Hi @gyin-ai, how are you measuring the overhead? The 1-2ms overhead is measured with a tool called GitHub - wg/wrk: Modern HTTP benchmarking tool

Yes. I used the wrk as well and tried it in k8s cluster as well. Using the same environment, torch serve can achieve 1ms. Btw, when I tried the pure fastAPI, it was also around 7ms.

1 Like