1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
[V] High: Completely blocks me.
2. Environment:
- Ray version: 2.44.0
- Python version: 3.12.9
- OS: Ubuntu 22.04
- Cloud/Infrastructure:
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: Same performance between standalone vLLM and Ray+vLLM
- Actual: Ray+vLLM represents 2~3x higher latency than standalone vLLM
I use your sample code (Serving LLMs — Ray 2.44.1) to run my model. I only add the argparse and update some values in deployment_config.
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
from ray.serve.schema import LoggingConfig
import os
import argparse
import time
import logging
import uvicorn
def configure_logging():
logging.basicConfig(level=logging.ERROR)
logger = logging.getLogger("ray")
logger.setLevel(logging.WARNING) # Modify the Ray logging config
# Configure Uvicorn logger
uvicorn_logger = logging.getLogger("uvicorn")
uvicorn_logger.setLevel(logging.WARNING)
# Optionally configure access logger if you want to modify that too
uvicorn_access_logger = logging.getLogger("uvicorn.access")
uvicorn_access_logger.setLevel(logging.WARNING)
def main():
configure_logging()
parser = argparse.ArgumentParser(description="Deploy and test LLM using Ray Serve")
parser.add_argument("--accelerator_type", type=str, help="MI300X or H100", required=True)
parser.add_argument("--model_path", type=str, help="Path to the model folder", required=True)
parser.add_argument("--tp", type=int, help="Tensor parallel size", required=True)
parser.add_argument("--quant_type", type=str, default=None, help="Quantization type")
parser.add_argument("--kv_type", type=str, default="auto", help="KV cache data type")
parser.add_argument("--max_model_len", type=int, help="Maximum model length", required=True)
parser.add_argument("--max_num_batched_tokens", type=int, help="Maximum number of batched tokens", required=True)
parser.add_argument("--concurrency", type=int, help="Concurrency level for testing", required=True)
parser.add_argument("--sche", type=int, help="Number of scheduler steps", required=True)
args = parser.parse_args()
# Create LLMConfig object
n_replica=1
llm_config = LLMConfig(
model_loading_config=dict(
model_id=args.model_path,
model_source=args.model_path,
),
deployment_config=dict(
max_ongoing_requests=256, # or use args.concurrency
autoscaling_config=dict(
initial_replicas=n_replica,
min_replicas=n_replica,
max_replicas=n_replica,
),
ray_actor_options=dict(
num_cpus=90
),
),
accelerator_type=args.accelerator_type,
engine_kwargs=dict(
swap_space=16,
tensor_parallel_size=args.tp,
num_scheduler_steps=args.sche,
dtype="float16",
gpu_memory_utilization=0.8,
enable_chunked_prefill=False,
enable_prefix_caching=False,
max_model_len=args.max_model_len,
max_num_batched_tokens=args.max_num_batched_tokens,
quantization=None if args.quant_type == 'None' else args.quant_type,
kv_cache_dtype=args.kv_type,
max_num_seqs=512,
max_seq_len_to_capture=4096,
disable_log_requests=True,
),
)
print(f"[DEBUG] llm_config={llm_config}")
# Deploy the LLMServer
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
# Run the serve deployment
logging_config = LoggingConfig(log_level="WARNING")
serve.run(llm_app, logging_config=logging_config)
while True:
time.sleep(100)
if __name__ == "__main__":
main()
My benchmark cmd:
python ray_engine.py --accelerator_type H100 \
--model_path /data/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct \
--tp 1 --quant_type None --kv_type auto --max_model_len 512 \
--max_num_batched_tokens 8192 --concurrency 64 --sche 4
# Run vLLM benchmark:
python3 /Path2vLLM/vllm/benchmarks/benchmark_serving.py --host localhost --backend openai --port 8000 \
--model /data/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct --dataset-name random \
--num-prompts 3000 --random-input-len 128 --random-output-len 128 \
--max-concurrency 256 \
--percentile-metrics ttft,tpot,itl,e2el
Note: You might notice that the benchmark_serving.py
script shows “Total generated tokens: 0” and “Mean TPOT (ms): 0.00”. While the benchmark script itself is not faulty, it doesn’t correctly parse the response from Ray. However, the TTFT metric is being reported correctly.
Here are the results:
input length and output length are all 128.
Model Instance/ Replica |
–max-concurrency | vLLM / Ray + vLLM | Mean TTFT (ms) |
Difference | ||
---|---|---|---|---|---|---|
1 | 1 | vLLM | 19 | 3.736842105 | ||
1 | 1 | Ray + vLLM | 71 | |||
1 | 2 | vLLM | 22 | 3.227272727 | ||
1 | 2 | Ray + vLLM | 71 | |||
1 | 4 | vLLM | 29 | 2.689655172 | ||
1 | 4 | Ray + vLLM | 78 | |||
1 | 8 | vLLM | 43 | 1.813953488 | ||
1 | 8 | Ray + vLLM | 78 | |||
1 | 16 | vLLM | 57 | 2.087719298 | ||
1 | 16 | Ray + vLLM | 119 | |||
1 | 32 | vLLM | 63 | 3.158730159 | ||
1 | 32 | Ray + vLLM | 199 |
From the table for example, when max-concurrency=1, Ray+vLLM has 71ms for TTFT, but standalone vLLM only has 19ms for TTFT.
How can I tune Ray argument to make it as fast as vLLM?