Non-linear throughput when scaling Ray Serve replicas

Hi @christina ,

Facing similar issues while scaling above 10 GPU worker replicas (they are all in different nodes) with each GPU worker node is having 6 model replica in L4 GPU(6 was giving expected latency for throughput initally), as the concurrency is increasing seeing a huge latency fluctuations and throughput drops
tuning max_ongoing_requests to 2 helped save some performance but as I scale concurrency above 200 seeing a huge drop in latency and increase in throughput,
any advice on workaround for backoff mechanism for higher throughput load.

Ref, my original issue