Understanding performance of Ray serve

I have a very simple identity method hosted by ray serve:

@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 1, "num_gpus": 0})
class IdentityService:
    def __init__(self):
        pass

    @serve.batch(max_batch_size=64, batch_wait_timeout_s=0.01)
    async def handle_batch(self, inputs): 
        print("Our input array has length:", len(inputs))
        return inputs

    async def __call__(self, request):
        return await self.handle_batch(request)
    
generator = IdentityService.bind()
handle = serve.run(generator)

Emulate as many requests as possible:

import asyncio

async def send_request():
    return await handle.remote(torch.randint(low=0, high=3, size=(INPUT_SIZE,)).float())

async def main():
    tasks = []
    for _ in range(10000):
        task = asyncio.create_task(send_request())
        tasks.append(task)

    return await asyncio.gather(*tasks)
    
await main()

No matter the different configuration options: num_replicas, num_cpus, max_concurrent_queries. I always get the same performance: process 10k requests in ~35sec and batch size between 5-8.

Can you please explain how to increase performance?

You’re likely running into the proxy actor as a bottleneck. We run one proxy actor per node, so I’d recommend increasing the number of nodes and seeing if that alleviates the bottleneck.

Can you please explain what you mean by node and how to increase it?
I run this code on my local machine with 16 CPUs. Changing num_cpus does not change performance.

Can you please explain what you mean by node and how to increase it? I run this code on my local machine with 16 CPUs

I mean the number of machines. Each machine runs only one proxy, so that proxy is likely the bottleneck. If you try this workload with more machines– but the same number of replicas– and the max QPS goes up, then the proxy actor is likely the bottleneck.

How to debug this proxy? How to make sure that the issue is with the proxy? My CPU load is low. The number of requests is pretty low. How to fix this bottleneck on the local machine?