Preprocessing in ray serve LLM

Hi everyone,

I want to use ray serve LLM in the following way to host a model that is found on huggingface.

In particular, I’m looking into this model:

As you can see in their huggingface code, the preprocessing one has to do before sending the explicit data to the model is custom:

image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,

    # FPS will be returned in video_kwargs
    "mm_processor_kwargs": video_kwargs,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)

Now, when I just start a llm-serve deployment with ray, I don’t have any control over how the preprocessor is called, most importantly, I don’t seem to have access to the explicit llm.generate call (where llm is a vllm LLM). Instead everything is hidden behind openAI like APIs, which have chat, complete, … functions.

What do I have to do to basically overload all of these openAI-API functions and just at its core call the models generate function so I can do all the preprocessing before by myself?

More explicitly, without ray I have code like this:

from vllm import AsyncLLMEngine
...
        
self.llm = AsyncLLMEngine.from_engine_args(
            AsyncEngineArgs(
                model=model_path,
                limit_mm_per_prompt={'image': 10, 'video': 10},
                tensor_parallel_size=tensor_parallel_size,
                gpu_memory_utilization=0.45,
                max_model_len=4096,
            )
        )
...
self.llm.generate(llm_inputs, sampling_params=self.sampling_params, request_id=request_id)

My understanding is that ray “just” wraps an actor around a vLLM engine.
So how do I get rid of all the surrounding complexity and use ray to directly call vLLM?

Thanks in advance!

M

You are correct: Ray Serve LLM APIs (like LLMServer and LLMRouter) expose only OpenAI-compatible endpoints (e.g., /v1/chat/completions) and do not provide a direct way to override the preprocessing or to call llm.generate with fully custom inputs such as multi_modal_data or mm_processor_kwargs. The Ray Serve LLM stack is designed to abstract away the engine details and standardize the API surface, so you cannot directly inject your own preprocessing or bypass the OpenAI API layer in a standard deployment. This limitation is documented and has been raised as a challenge for advanced multimodal models that require custom input handling, such as Cosmos-Reason1-7B or Qwen2-VL, where the Hugging Face example code expects you to build the prompt and multimodal data yourself before calling llm.generate (see Hugging Face model card).

To achieve the level of control you want, you must create a custom Ray Serve deployment that wraps your own vLLM engine instance (e.g., using AsyncLLMEngine), exposes your own FastAPI endpoints, and handles all preprocessing and direct calls to llm.generate as you wish. This is the approach shown in the Ray vLLM Serve example, where you define a deployment class, instantiate the engine, and implement your own HTTP handlers. This way, you can fully control the request format, preprocessing, and the call to llm.generate, just as you do outside of Ray. The OpenAI-compatible APIs in Ray Serve LLM are not designed for this use case and cannot be easily overloaded for arbitrary multimodal input logic.

Would you like a step-by-step example of how to build such a custom Ray Serve deployment for Cosmos-Reason1-7B?

Sources:

Hint: Mention @RunLLM in the post for followups.

Alright, thank you LLM, that helped.

A point to the ray-devs:
It was not clear to me from the documentation how to achieve this.

My earlier tries that I did with vLLM in ray had issues, in particular the multiprocessing from ray and the one from vLLM seemed to collide.

I think it would be useful to link the example that the LLM provided somewhere more prominently in the docs.

At least for me, while the convenience in the definition of the deployment for the HF models is great, “just” hosting a model from huggingface without any control over pre-processing is not very useful.

Thanks!

Thank you for the feedback @manuel.ramblr ! I’ll pass the feedback to our docs team and hopefully we can get that patched up soon :slight_smile: