Ray multiplexing for higher concurrency

Hi Ray Team,

I have deployed a Ray Service with model mulitplexing.

Avg per model latency : 30ms
Avg per model loadtime : 2sec

The system is stable when i have till 200 concurrent request, but latency slowly continues to add up, but after 200 concurrent latency is increasing.
Increasing the number of pods, or model replicas, nor adjusting max_ongoing_requests or max_queued_requests is helping much

Increasing the GPUS pods/ model replicas isn’t helping and only keeping the system under utilized

Is there any document/procedure to scale ray with model multiplexing?

Ray Serve model multiplexing can hit scaling bottlenecks due to how requests are routed and models are loaded/unloaded on replicas. If increasing the number of replicas or GPUs does not improve throughput and the system remains underutilized, the likely causes are: (1) requests for the same model are not being grouped, leading to frequent model swaps and high load/unload overhead, and (2) the autoscaler is not “model-aware,” so it may not scale replicas in a way that keeps hot models loaded and ready for traffic. This is a known limitation, and there is currently no official document that guarantees linear scaling for multiplexed deployments under high concurrency or non-sparse workloads. See the discussion in Ray GitHub Issue #50695 and Ray GitHub Issue #48741.

To improve scaling, consider: (a) reducing model load/unload frequency by batching or grouping requests for the same model (this is not yet natively supported, but is a requested feature), (b) tuning max_num_models_per_replica to allow more models to be cached per replica, and (c) using custom autoscaling policies that are “model-aware” if possible. Also, monitor the built-in multiplexing metrics (e.g., model load/unload counters, queue sizes) to identify bottlenecks (Ray Serve Monitoring). If you need more detail or code examples, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.