Non-linear throughput when scaling Ray Serve replicas

raphael · August 7, 2025, 3:47pm

Severity: Medium – Significantly affects my productivity but I can find a workaround.

Environment:

Ray version: 2.48.0
Python version: 3.12.11
OS: Ubuntu 22, no docker
Infra: Ray autoscaler with AWS EC2

What I expected

I’m serving a SigLIP model (google/siglip-large-patch16-384) using Ray Serve. I expected that increasing the number of replicas would linearly (or close to linearly) increase throughput under load.

Example:
With 1 replica → expected ~5 req/s
With 2 replicas → expected ~10 req/s
With 3 replicas → expected ~15 req/s

What actually happened

Here’s the measured performance (logs below):

1 replica: 4.9 req/s
2 replicas: 6.8 req/s
3 replicas: 9.7 req/s

So throughput grows sub-linearly when I add replicas, which is far from what I expected.

Repro code (simplified)

I’m sending 200 async requests to the deployment:

handle = serve.get_deployment_handle("MultimodalEmbeddingService", app_name="ai_multimodal_embedding_service")

semaphore = asyncio.Semaphore(50)  # Number of concurrent requests

async def process_task(i):
    async with semaphore:
        response = await handle.run.remote(
            {
                "data": [{"image": np.random.rand(640, 640, 3), "id": f"test_image_{i}"}],
                "params": {"ai_model_name": "google/siglip-large-patch16-384"},
            }
        )
        return i, response


tasks = [process_task(i) for i in range(number_of_tasks)]
results = await asyncio.gather(*tasks)

Deployment uses:

@serve.deployment(ray_actor_options={"num_gpus": 1})
class MultimodalEmbeddingService:
    ...

What I’ve checked so far

Each replica has its own GPU (Ray autoscaler allocated 3 GPUs / 14 CPUs) and is fully running when I launch my script
Tried increasing/decreasing semaphore size (concurrency)
Logs don’t show crashes
Model returns valid results
This is running in a multi-node EC2 cluster
Dashboard shows all replicas are healthy and GPU utilization seems normal

Questions

What could cause sub-linear throughput when scaling the number of replicas in Ray Serve?
Could it be due to scheduling or data transfer bottlenecks?

Let me know if you’d like full logs or more code.
Thanks in advance!

christina · August 7, 2025, 7:56pm

Sublinear throughput in Ray Serve when increasing replicas has happened before and can be caused by several things. Ray Serve’s request routing (power-of-two-choices), backoff mechanisms to avoid overloading replicas, and potential bottlenecks in data serialization/deserialization or network transfer, especially with large payloads.

See the following discussions/docs: Ray Serve Performance Tuning, GitHub Issue #52609, GitHub Issue #52745.

Kind of! If your requests include very big payloads, the serialization + network transfer between the client, proxy, and replicas can become a bottleneck. This overhead increases with the number of replicas, especially if they are on different nodes, and can limit throughput.

Do you know how big your payloads are?

raphael · August 8, 2025, 10:53am

Each payload was approximately 1 MB. I tested passing an S3 URL directly, with the replica downloading the data locally, and observed improved performance. Throughput also appears to scale almost linearly with the number of replicas.

I have a follow-up question: if my model can only process one request at a time (GPU fully utilized), would a configuration of target_ongoing_requests: 1 and max_ongoing_requests: 10 be appropriate? I tried setting max_ongoing_requests to 1, but it actually degraded performance.

Thank you.

manickavela29 · September 19, 2025, 2:50pm

Hi @christina ,

Facing similar issues while scaling above 10 GPU worker replicas (they are all in different nodes) with each GPU worker node is having 6 model replica in L4 GPU(6 was giving expected latency for throughput initally), as the concurrency is increasing seeing a huge latency fluctuations and throughput drops
tuning max_ongoing_requests to 2 helped save some performance but as I scale concurrency above 200 seeing a huge drop in latency and increase in throughput,
any advice on workaround for backoff mechanism for higher throughput load.

Ref, my original issue

Topic		Replies	Views
Low througput and not able to scale with ray serve Ray Serve	1	45	May 6, 2025
Ray Serve - Setting num_replicas > 1 errors out and not using GPU Ray Serve	5	1018	January 13, 2022
Ray Serve is executing the requests sequentially instead parallel even after configuring auto-scale Ray Serve	11	910	October 20, 2023
Not sure how num_replicas works Ray Serve	5	1741	March 4, 2021
Autoscaling Replicas in Ray Serve Ray Serve	5	1720	March 12, 2021