Ray Serve LLM APIs has 2~3x higher latency

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
[V] High: Completely blocks me.

2. Environment:

  • Ray version: 2.44.0
  • Python version: 3.12.9
  • OS: Ubuntu 22.04
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: Same performance between standalone vLLM and Ray+vLLM
  • Actual: Ray+vLLM represents 2~3x higher latency than standalone vLLM

I use your sample code (Serving LLMs — Ray 2.44.1) to run my model. I only add the argparse and update some values in deployment_config.

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
from ray.serve.schema import LoggingConfig
import os
import argparse
import time
import logging
import uvicorn

def configure_logging():
    logging.basicConfig(level=logging.ERROR)
    
    logger = logging.getLogger("ray")
    logger.setLevel(logging.WARNING) # Modify the Ray logging config

    # Configure Uvicorn logger
    uvicorn_logger = logging.getLogger("uvicorn")
    uvicorn_logger.setLevel(logging.WARNING)
    
    # Optionally configure access logger if you want to modify that too
    uvicorn_access_logger = logging.getLogger("uvicorn.access")
    uvicorn_access_logger.setLevel(logging.WARNING)


def main():
    configure_logging()

    parser = argparse.ArgumentParser(description="Deploy and test LLM using Ray Serve")
    parser.add_argument("--accelerator_type", type=str, help="MI300X or H100", required=True)
    parser.add_argument("--model_path", type=str, help="Path to the model folder", required=True)
    parser.add_argument("--tp", type=int, help="Tensor parallel size", required=True)
    parser.add_argument("--quant_type", type=str, default=None, help="Quantization type")
    parser.add_argument("--kv_type", type=str, default="auto", help="KV cache data type")
    parser.add_argument("--max_model_len", type=int, help="Maximum model length", required=True)
    parser.add_argument("--max_num_batched_tokens", type=int, help="Maximum number of batched tokens", required=True)
    parser.add_argument("--concurrency", type=int, help="Concurrency level for testing", required=True)
    parser.add_argument("--sche", type=int, help="Number of scheduler steps", required=True)
    args = parser.parse_args()
    # Create LLMConfig object

    n_replica=1
    llm_config = LLMConfig(
        model_loading_config=dict(
            model_id=args.model_path,
            model_source=args.model_path,
        ),
        deployment_config=dict(
            max_ongoing_requests=256, # or use args.concurrency
            autoscaling_config=dict(
                initial_replicas=n_replica,
                min_replicas=n_replica,
                max_replicas=n_replica,
            ),
            ray_actor_options=dict(
                num_cpus=90
            ),
        ),
        accelerator_type=args.accelerator_type,
        engine_kwargs=dict(
            swap_space=16,
            tensor_parallel_size=args.tp,
            num_scheduler_steps=args.sche,
            dtype="float16",
            gpu_memory_utilization=0.8,
            enable_chunked_prefill=False,
            enable_prefix_caching=False,
            max_model_len=args.max_model_len,
            max_num_batched_tokens=args.max_num_batched_tokens,
            quantization=None if args.quant_type == 'None' else args.quant_type,
            kv_cache_dtype=args.kv_type,
            max_num_seqs=512,
            max_seq_len_to_capture=4096,
            disable_log_requests=True,
        ),
    )

    print(f"[DEBUG] llm_config={llm_config}")
    # Deploy the LLMServer
    deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
    llm_app = LLMRouter.as_deployment().bind([deployment])

    # Run the serve deployment
    logging_config = LoggingConfig(log_level="WARNING")
    serve.run(llm_app, logging_config=logging_config)
    
    while True:
        time.sleep(100)

if __name__ == "__main__":
    main()

My benchmark cmd:

python ray_engine.py --accelerator_type H100 \
	--model_path /data/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct \
	--tp 1 --quant_type None --kv_type auto --max_model_len 512 \
	--max_num_batched_tokens 8192 --concurrency 64 --sche 4 

# Run vLLM benchmark:
python3 /Path2vLLM/vllm/benchmarks/benchmark_serving.py --host localhost --backend openai --port 8000 \
    --model /data/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct --dataset-name random \
    --num-prompts 3000 --random-input-len 128 --random-output-len 128 \
    --max-concurrency 256 \
    --percentile-metrics ttft,tpot,itl,e2el 

Note: You might notice that the benchmark_serving.py script shows “Total generated tokens: 0” and “Mean TPOT (ms): 0.00”. While the benchmark script itself is not faulty, it doesn’t correctly parse the response from Ray. However, the TTFT metric is being reported correctly.

Here are the results:
input length and output length are all 128.

Model
Instance/
Replica
–max-concurrency vLLM / Ray + vLLM Mean
TTFT (ms)
Difference
1 1 vLLM 19 3.736842105
1 1 Ray + vLLM 71
1 2 vLLM 22 3.227272727
1 2 Ray + vLLM 71
1 4 vLLM 29 2.689655172
1 4 Ray + vLLM 78
1 8 vLLM 43 1.813953488
1 8 Ray + vLLM 78
1 16 vLLM 57 2.087719298
1 16 Ray + vLLM 119
1 32 vLLM 63 3.158730159
1 32 Ray + vLLM 199

From the table for example, when max-concurrency=1, Ray+vLLM has 71ms for TTFT, but standalone vLLM only has 19ms for TTFT.

How can I tune Ray argument to make it as fast as vLLM?

1 Like

Hi @Jacob_Chang Thanks for reporting on this! We are actively looking into this issue and will share back when we know more about why there is this gap.

@Gene Thank you!
Is this a known issue on your end?
I got the profile by others using nsight system to profile on vLLM(left) and Ray+vLLM(right).
Standalone vLLM spends 80.9% of its runtime on kernels, while Ray+vLLM uses only 20.3%. The lower kernel usage in Ray+vLLM could be the reason of higher TTFT and reduced throughput.

hi @Jacob_Chang , posted our recent findings as an issue on github. Basically the scope of this overhead is when stream=True and at high concurrency. The overhead in stream=False or low concurrency should be super small.

This should mitigate the overhead issue to some extend at high concurrencies.

So basically with this PR you can tune “stream_batching_interval_ms” to reduce the e2e latency at high concurrency levels. In my benchmarks I was able to completely reduce the overhead up to concurrency=128 but at 256 I was able to cut the overhead by half, tho there is still some left.

The issue should fundamentally be addressed at ray core level which will be prioritized among other things in this or next quarter.

@kourosh Good to hear from you. Thanks for the approach. Let me try it later!