Severity: Medium – Significantly affects my productivity but I can find a workaround.
Environment:
-
Ray version: 2.48.0
-
Python version: 3.12.11
-
OS: Ubuntu 22, no docker
-
Infra: Ray autoscaler with AWS EC2
What I expected
I’m serving a SigLIP model (google/siglip-large-patch16-384
) using Ray Serve. I expected that increasing the number of replicas would linearly (or close to linearly) increase throughput under load.
Example:
With 1 replica → expected ~5 req/s
With 2 replicas → expected ~10 req/s
With 3 replicas → expected ~15 req/s
What actually happened
Here’s the measured performance (logs below):
-
1 replica: 4.9 req/s
-
2 replicas: 6.8 req/s
-
3 replicas: 9.7 req/s
So throughput grows sub-linearly when I add replicas, which is far from what I expected.
Repro code (simplified)
I’m sending 200 async requests to the deployment:
handle = serve.get_deployment_handle("MultimodalEmbeddingService", app_name="ai_multimodal_embedding_service")
semaphore = asyncio.Semaphore(50) # Number of concurrent requests
async def process_task(i):
async with semaphore:
response = await handle.run.remote(
{
"data": [{"image": np.random.rand(640, 640, 3), "id": f"test_image_{i}"}],
"params": {"ai_model_name": "google/siglip-large-patch16-384"},
}
)
return i, response
tasks = [process_task(i) for i in range(number_of_tasks)]
results = await asyncio.gather(*tasks)
Deployment uses:
@serve.deployment(ray_actor_options={"num_gpus": 1})
class MultimodalEmbeddingService:
...
What I’ve checked so far
-
Each replica has its own GPU (Ray autoscaler allocated
3 GPUs / 14 CPUs
) and is fully running when I launch my script -
Tried increasing/decreasing semaphore size (concurrency)
-
Logs don’t show crashes
-
Model returns valid results
-
This is running in a multi-node EC2 cluster
-
Dashboard shows all replicas are healthy and GPU utilization seems normal
Questions
-
What could cause sub-linear throughput when scaling the number of replicas in Ray Serve?
-
Could it be due to scheduling or data transfer bottlenecks?
Let me know if you’d like full logs or more code.
Thanks in advance!