Parallel requests to a ray serve ‘OpenAI Chat Completions API’ based on this instruction: Serve a Large Language Model with vLLM — Ray 2.41.0
The model is qwen2-vl, and the request contains both text and image.
It is normal when call one request at one time, but get error when parallel requesting with ‘max_ongoing_requests >= 2’.
The error stack shows below:
ERROR 2025-01-23 00:22:21,963 vl_VLLMDeployment 4gtvteb2 e1d433cc-e551-4e5e-b10e-986dea9fe1ad /v1/chat/completions llm.py:128 - Error in generate()
Traceback (most recent call last):
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner_base.py”, line 116, in _wrapper
return func(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py”, line 1654, in execute_model
hidden_or_intermediate_states = model_executable(
File “/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1747, in _call_impl
return forward_call(*args, **kwargs)
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_vl.py”, line 1287, in forward
inputs_embeds = self._merge_multimodal_embeddings(
File “/home/ray/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_vl.py”, line 1237, in _merge_multimodal_embeddings
inputs_embeds[mask, :] = multimodal_embeddings
RuntimeError: shape mismatch: value tensor of shape [644, 3584] cannot be broadcast to indexing result of shape [322, 3584]