Hi everyone,
I want to use ray serve LLM in the following way to host a model that is found on huggingface.
In particular, I’m looking into this model:
As you can see in their huggingface code, the preprocessing one has to do before sending the explicit data to the model is custom:
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
# FPS will be returned in video_kwargs
"mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
Now, when I just start a llm-serve deployment with ray, I don’t have any control over how the preprocessor is called, most importantly, I don’t seem to have access to the explicit llm.generate call (where llm is a vllm LLM). Instead everything is hidden behind openAI like APIs, which have chat, complete, … functions.
What do I have to do to basically overload all of these openAI-API functions and just at its core call the models generate function so I can do all the preprocessing before by myself?
More explicitly, without ray I have code like this:
from vllm import AsyncLLMEngine
...
self.llm = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(
model=model_path,
limit_mm_per_prompt={'image': 10, 'video': 10},
tensor_parallel_size=tensor_parallel_size,
gpu_memory_utilization=0.45,
max_model_len=4096,
)
)
...
self.llm.generate(llm_inputs, sampling_params=self.sampling_params, request_id=request_id)
My understanding is that ray “just” wraps an actor around a vLLM engine.
So how do I get rid of all the surrounding complexity and use ray to directly call vLLM?
Thanks in advance!
M