Ray Serve LLM APIs has 2~3x higher latency

Jacob_Chang · April 23, 2025, 9:36am

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
[V] High: Completely blocks me.

2. Environment:

Ray version: 2.44.0
Python version: 3.12.9
OS: Ubuntu 22.04
Cloud/Infrastructure:
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: Same performance between standalone vLLM and Ray+vLLM
Actual: Ray+vLLM represents 2~3x higher latency than standalone vLLM

I use your sample code (Serving LLMs — Ray 2.44.1) to run my model. I only add the argparse and update some values in deployment_config.

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
from ray.serve.schema import LoggingConfig
import os
import argparse
import time
import logging
import uvicorn

def configure_logging():
    logging.basicConfig(level=logging.ERROR)
    
    logger = logging.getLogger("ray")
    logger.setLevel(logging.WARNING) # Modify the Ray logging config

    # Configure Uvicorn logger
    uvicorn_logger = logging.getLogger("uvicorn")
    uvicorn_logger.setLevel(logging.WARNING)
    
    # Optionally configure access logger if you want to modify that too
    uvicorn_access_logger = logging.getLogger("uvicorn.access")
    uvicorn_access_logger.setLevel(logging.WARNING)


def main():
    configure_logging()

    parser = argparse.ArgumentParser(description="Deploy and test LLM using Ray Serve")
    parser.add_argument("--accelerator_type", type=str, help="MI300X or H100", required=True)
    parser.add_argument("--model_path", type=str, help="Path to the model folder", required=True)
    parser.add_argument("--tp", type=int, help="Tensor parallel size", required=True)
    parser.add_argument("--quant_type", type=str, default=None, help="Quantization type")
    parser.add_argument("--kv_type", type=str, default="auto", help="KV cache data type")
    parser.add_argument("--max_model_len", type=int, help="Maximum model length", required=True)
    parser.add_argument("--max_num_batched_tokens", type=int, help="Maximum number of batched tokens", required=True)
    parser.add_argument("--concurrency", type=int, help="Concurrency level for testing", required=True)
    parser.add_argument("--sche", type=int, help="Number of scheduler steps", required=True)
    args = parser.parse_args()
    # Create LLMConfig object

    n_replica=1
    llm_config = LLMConfig(
        model_loading_config=dict(
            model_id=args.model_path,
            model_source=args.model_path,
        ),
        deployment_config=dict(
            max_ongoing_requests=256, # or use args.concurrency
            autoscaling_config=dict(
                initial_replicas=n_replica,
                min_replicas=n_replica,
                max_replicas=n_replica,
            ),
            ray_actor_options=dict(
                num_cpus=90
            ),
        ),
        accelerator_type=args.accelerator_type,
        engine_kwargs=dict(
            swap_space=16,
            tensor_parallel_size=args.tp,
            num_scheduler_steps=args.sche,
            dtype="float16",
            gpu_memory_utilization=0.8,
            enable_chunked_prefill=False,
            enable_prefix_caching=False,
            max_model_len=args.max_model_len,
            max_num_batched_tokens=args.max_num_batched_tokens,
            quantization=None if args.quant_type == 'None' else args.quant_type,
            kv_cache_dtype=args.kv_type,
            max_num_seqs=512,
            max_seq_len_to_capture=4096,
            disable_log_requests=True,
        ),
    )

    print(f"[DEBUG] llm_config={llm_config}")
    # Deploy the LLMServer
    deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
    llm_app = LLMRouter.as_deployment().bind([deployment])

    # Run the serve deployment
    logging_config = LoggingConfig(log_level="WARNING")
    serve.run(llm_app, logging_config=logging_config)
    
    while True:
        time.sleep(100)

if __name__ == "__main__":
    main()

My benchmark cmd:

python ray_engine.py --accelerator_type H100 \
	--model_path /data/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct \
	--tp 1 --quant_type None --kv_type auto --max_model_len 512 \
	--max_num_batched_tokens 8192 --concurrency 64 --sche 4 

# Run vLLM benchmark:
python3 /Path2vLLM/vllm/benchmarks/benchmark_serving.py --host localhost --backend openai --port 8000 \
    --model /data/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct --dataset-name random \
    --num-prompts 3000 --random-input-len 128 --random-output-len 128 \
    --max-concurrency 256 \
    --percentile-metrics ttft,tpot,itl,e2el

Note: You might notice that the benchmark_serving.py script shows “Total generated tokens: 0” and “Mean TPOT (ms): 0.00”. While the benchmark script itself is not faulty, it doesn’t correctly parse the response from Ray. However, the TTFT metric is being reported correctly.

Here are the results:
input length and output length are all 128.

Model Instance/ Replica	–max-concurrency	vLLM / Ray + vLLM	Mean TTFT (ms)	Difference
1	1	vLLM	19	3.736842105
1	1	Ray + vLLM	71
1	2	vLLM	22	3.227272727
1	2	Ray + vLLM	71
1	4	vLLM	29	2.689655172
1	4	Ray + vLLM	78
1	8	vLLM	43	1.813953488
1	8	Ray + vLLM	78
1	16	vLLM	57	2.087719298
1	16	Ray + vLLM	119
1	32	vLLM	63	3.158730159
1	32	Ray + vLLM	199

From the table for example, when max-concurrency=1, Ray+vLLM has 71ms for TTFT, but standalone vLLM only has 19ms for TTFT.

How can I tune Ray argument to make it as fast as vLLM?

Gene · April 24, 2025, 6:08pm

Hi @Jacob_Chang Thanks for reporting on this! We are actively looking into this issue and will share back when we know more about why there is this gap.

Jacob_Chang · April 29, 2025, 6:50am

@Gene Thank you!
Is this a known issue on your end?
I got the profile by others using nsight system to profile on vLLM(left) and Ray+vLLM(right).
Standalone vLLM spends 80.9% of its runtime on kernels, while Ray+vLLM uses only 20.3%. The lower kernel usage in Ray+vLLM could be the reason of higher TTFT and reduced throughput.

kourosh · May 2, 2025, 7:11pm

hi @Jacob_Chang , posted our recent findings as an issue on github. Basically the scope of this overhead is when stream=True and at high concurrency. The overhead in stream=False or low concurrency should be super small.

github.com/ray-project/ray

[ray.serve.llm] serve.llm with streaming has overhead compared to vllm-v0 for a single replica when concurrency > 32

opened 05:31PM - 02 May 25 UTC

kouroshHakha

serve performance

### What happened + What you expected to happen The observation is this: when w…e do a comparison between vllm (baseline) and serve.llm at different concurrency levels under stream=True conditions we see a significant compounding difference between the two. This overhead does not exist when stream=False. Here is the end to end latency comparison between the two. ![Image](https://github.com/user-attachments/assets/454e151f-03ab-43f3-bfef-36d1291c7f3e) **Note that this observation only manifest itself at high concurrency for streaming oriented applications. Usually at this stage users would most likely autoscale to more replicas anyways. The bad thing is that it will massively hit the SLA until autoscaling happens.** Related issues -------------- https://discuss.ray.io/t/ray-serve-llm-apis-has-2-3x-higher-latency/22356/3 https://github.com/ray-project/ray/issues/52745 ## Results of latest investigations We created a lightweight repro that suggests that the performance gap could be explained entirely from ray's streaming behavior. Ray core team + serve will be looking at this in more details. Follow up issue created for core / serve here: https://github.com/ray-project/ray/issues/52745 ### Versions / Dependencies vllm version: 0.8.5 ray version: 2.45.0 ### Reproduction script For vllm bars here we did: ```bash VLLM_USE_V1="0" vllm serve meta-llama/Llama-3.1-8B-Instruct --swap-space 16 --tensor-parallel-size 1 --num-scheduler-steps 4 --dtype float16 --gpu-memory-utilization 0.8 --no-enable-chunked-prefill --no-enable-prefix-caching --max-model-len 512 --max-num-batched-tokens 8192 --quantization None --kv-cache-dtype auto --max-num-seqs 512 --max-seq-len-to-capture 4096 --disable-log-requests ``` ### Serve LLM Yaml: ```yaml applications: - args: llm_configs: - model_loading_config: model_id: unsloth/Llama-3.2-8B-Instruct model_source: unsloth/Llama-3.2-8B-Instruct deployment_config: max_ongoing_requests: 999 autoscaling_config: initial_replicas: 1 min_replicas: 1 max_replicas: 1 runtime_env: env_vars: VLLM_USE_V1: "1" engine_kwargs: swap_space: 16 tensor_parallel_size: 1 dtype: "float16" gpu_memory_utilization: 0.8 enable_chunked_prefill: false enable_prefix_caching: false max_model_len: 512 max_num_batched_tokens: 8192 quantization: null kv_cache_dtype: auto max_num_seqs: 512 max_seq_len_to_capture: 4096 disable_log_requests: true import_path: ray.serve.llm:build_openai_app name: llm_app route_prefix: "/" ``` and then do `serve run config.yaml` For benchmarking we always do ```bash python benchmark_serving.py --host localhost --backend openai --port 8000 --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --random-input-len 128 --random-output-len 128 --percentile-metrics ttft,tpot,itl,e2el --save-result --num-prompts 300 --max-concurrency <concurrency> --ignore-eos ``` This benchmarking script is [here](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py). And modify the benchmark script to make stream=False / True. ### Issue Severity None

kourosh · May 6, 2025, 5:26pm

This should mitigate the overhead issue to some extend at high concurrencies.

github.com/ray-project/ray

[Serve.llm] Mitigate the serve.llm streaming overhead by properly batching stream chunks

master ← kouroshHakha:kh/fix-batching

opened 07:11PM - 03 May 25 UTC

kouroshHakha

+995 -568

This PR attempts to reduce [the overhead of streaming](https://github.com/ray-pr…oject/ray/issues/52746) in serve.llm by properly batching stream chunks so that the latency is not hurt due to ray's streaming [overhead](https://github.com/ray-project/ray/issues/52745). It basically refactors the LLMServer and vLLMEngine so that streaming batching happens at the LLMServing layer instead of vLLMEngine. `LLMRouter` is also adjusted to assume that it works with chunks of streamed items. I also noticed that before this PR we were doing the batching inside the vLLMEngine then the LLMServer would unpack the batch and stream the individual chunks through the remote channel back to the router. This is a problem at high qps where ray's streaming becomes the [botteleneck](https://github.com/ray-project/ray/issues/52745) and the whole end to end latency will be impacted by the streaming issue. With proper batching we can mitigate this problem. Unpacking should be done at the very last stage when router wants to send back the results to the http proxy. [Benchmark spreadsheet](https://docs.google.com/spreadsheets/d/1pEQoHtClGMJTnYrcS1w-cdeG4Y1lwb59rr286vYh1XU/edit?gid=537920161#gid=537920161) | concurrency | vllm v1 | this branch (stream=False) | this branch (stream = True, t=inf) | this branch (stream = True, t=200ms, nrouter=16) | master(stream = True, t=200ms, nrouter=16) | | :---------: | :-----: | :------------------------: | :--------------------------------: | :----------------------------------------------: | :----------------------------------------: | | 16 | 1045.53 | 1095.86 | 1124.83 | 1090.71 | 1140.6 | | 32 | 1159.93 | 1213.22 | 1298.45 | 1227.9 | 1299.15 | | 128 | 1802.29 | 1912.19 | 1992.26 | 2003.23 | 3581.15 | | 256 | 2734.02 | 2893.49 | 3580.04 | 4601.61 | 8084.21 | ![chart](https://github.com/user-attachments/assets/baa145f4-2618-4337-a18a-5343d6179904) **Notable observation:** when I compare `stream=True` and `stream_batching_interval=inf` this should in theory be the same as `stream=False`. Ultimately this setting gives the upper bound for how far batching can push the performance at high concurrencies. But the interesting part is that, at `concurrency=256`, this scenario has about 25% overhead compared to `stream=False` (e2e mean of 2.9s vs 3.6s), suggesting that there are some other overheads due to differences between critical path when `stream=True/False`. Things that I have tried to narrow down the source of this difference: - Can the differences be explained by serialization overhead? I made a quick change where in Batcher.merge_results, I simply return the raw result objects without any serialization (ray default uses pickle) vs. passing a json serialized representation and doing the corresponding unserialization in the _openai_wrapper in the router. This approach didn't help much. I did not spot any substantial diff when doing ablation against the default's ray serialization, suggesting that serialization overhead may not be the primary bottleneck in the current pipeline. - I also measured the latency of each request that is accumulated within `ResponsePostProcessor` and returning it in the body of the usage section of the response so I can average it between the two baselines. Stream=False, e2e server latency=2.7s, e2e client latency: 2.9s Stream=True, t=inf, e2e server latency=3, e2e client latency: 3.6s This suggests that about 0.6 second of the overhead is explained by the layers above llm_server. The most promising hypothesis right now is that during streaming the batched chunks actually construct a much bigger payload compared to when stream=False. This could introduce much bigger overhead on LLMServer's remote call. This should be addressed at ray core and is in line with observations from https://github.com/ray-project/ray/issues/52745 # TODO: - [x] Unittests for batching at llmserver

So basically with this PR you can tune “stream_batching_interval_ms” to reduce the e2e latency at high concurrency levels. In my benchmarks I was able to completely reduce the overhead up to concurrency=128 but at 256 I was able to cut the overhead by half, tho there is still some left.

The issue should fundamentally be addressed at ray core level which will be prioritized among other things in this or next quarter.

Jacob_Chang · May 7, 2025, 3:40am

@kourosh Good to hear from you. Thanks for the approach. Let me try it later!

Topic		Replies	Views
Ray serve blocking requests when serving an LLM Ray Serve	3	119	October 20, 2024
Ray Serve LLM example in document cannot work Ray Serve LLM APIs	6	115	April 3, 2025
Low latency runtime inference Ray Serve	3	40	April 16, 2025
Low througput and not able to scale with ray serve Ray Serve	1	20	May 6, 2025
Ray Serve - Observing high latencies when using custom docker image	0	12	December 11, 2024

Ray Serve LLM APIs has 2~3x higher latency

Related topics