Non-linear throughput when scaling Ray Serve replicas


Severity: Medium – Significantly affects my productivity but I can find a workaround.


Environment:

  • Ray version: 2.48.0

  • Python version: 3.12.11

  • OS: Ubuntu 22, no docker

  • Infra: Ray autoscaler with AWS EC2


What I expected

I’m serving a SigLIP model (google/siglip-large-patch16-384) using Ray Serve. I expected that increasing the number of replicas would linearly (or close to linearly) increase throughput under load.

Example:
With 1 replica → expected ~5 req/s
With 2 replicas → expected ~10 req/s
With 3 replicas → expected ~15 req/s


What actually happened

Here’s the measured performance (logs below):

  • 1 replica: 4.9 req/s :white_check_mark:

  • 2 replicas: 6.8 req/s :cross_mark:

  • 3 replicas: 9.7 req/s :cross_mark:

So throughput grows sub-linearly when I add replicas, which is far from what I expected.


Repro code (simplified)

I’m sending 200 async requests to the deployment:

handle = serve.get_deployment_handle("MultimodalEmbeddingService", app_name="ai_multimodal_embedding_service")

semaphore = asyncio.Semaphore(50)  # Number of concurrent requests

async def process_task(i):
    async with semaphore:
        response = await handle.run.remote(
            {
                "data": [{"image": np.random.rand(640, 640, 3), "id": f"test_image_{i}"}],
                "params": {"ai_model_name": "google/siglip-large-patch16-384"},
            }
        )
        return i, response


tasks = [process_task(i) for i in range(number_of_tasks)]
results = await asyncio.gather(*tasks)

Deployment uses:

@serve.deployment(ray_actor_options={"num_gpus": 1})
class MultimodalEmbeddingService:
    ...

What I’ve checked so far

  • Each replica has its own GPU (Ray autoscaler allocated 3 GPUs / 14 CPUs) and is fully running when I launch my script

  • Tried increasing/decreasing semaphore size (concurrency)

  • Logs don’t show crashes

  • Model returns valid results

  • This is running in a multi-node EC2 cluster

  • Dashboard shows all replicas are healthy and GPU utilization seems normal


Questions

  1. What could cause sub-linear throughput when scaling the number of replicas in Ray Serve?

  2. Could it be due to scheduling or data transfer bottlenecks?

Let me know if you’d like full logs or more code.
Thanks in advance!

Sublinear throughput in Ray Serve when increasing replicas has happened before and can be caused by several things. Ray Serve’s request routing (power-of-two-choices), backoff mechanisms to avoid overloading replicas, and potential bottlenecks in data serialization/deserialization or network transfer, especially with large payloads.

See the following discussions/docs: Ray Serve Performance Tuning, GitHub Issue #52609, GitHub Issue #52745.

Kind of! If your requests include very big payloads, the serialization + network transfer between the client, proxy, and replicas can become a bottleneck. This overhead increases with the number of replicas, especially if they are on different nodes, and can limit throughput.

Do you know how big your payloads are?

1 Like

Each payload was approximately 1 MB. I tested passing an S3 URL directly, with the replica downloading the data locally, and observed improved performance. Throughput also appears to scale almost linearly with the number of replicas.

I have a follow-up question: if my model can only process one request at a time (GPU fully utilized), would a configuration of target_ongoing_requests: 1 and max_ongoing_requests: 10 be appropriate? I tried setting max_ongoing_requests to 1, but it actually degraded performance.

Thank you.