Offline inference vLLM: map_batches vs build_llm_processor

Monil · December 12, 2025, 5:17pm

Hey community,

I am doing offline inference with Ray. I see two approaches when it comes to vLLM. The first approach as mentioned here End-to-end: Offline Batch Inference — Ray 2.52.1 uses sequence of map_batches. I am following this approach in my current implementation where I have a class that initializes vLLM and calls its vLLM.generate method on the dataset.

But then I see, Ray supported version of doing offline inference as mentioned here Working with LLMs — Ray 2.52.1 where Ray provides build_llm_processor to set all configs.

Which one is better approach when it comes to LLM? Note: I am using custom datasource implementation.

RunLLM · December 12, 2025, 5:17pm

The recommended approach for LLM offline inference in Ray is to use the build_llm_processor API from ray.data.llm, rather than manually initializing vLLM in a custom map_batches class. The build_llm_processor method provides a higher-level, production-ready interface that handles batching, resource orchestration, error handling, and integration with vLLM out of the box. It also supports advanced features like model parallelism, LoRA, embeddings, and vision-language models, and is designed for scalability and reliability in distributed settings. Manual map_batches with custom vLLM logic is more error-prone and lacks these optimizations and features, making it less suitable for large-scale or production workloads.

For custom data sources, you can still use build_llm_processor by applying it to your Ray Dataset after ingesting data from your source. This approach is documented and supported for both standard and custom pipelines. See Ray Data LLM documentation, Anyscale LLM batch inference guide, and example code for details.

Would you like a step-by-step example of using build_llm_processor with a custom data source?

Sources:

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 6:17pm

@RunLLM I see the processor takes Dataset and returns result dataset. I have couple of questions on that.

does it modify original dataset? If yes, how? does it add a new column in lets say panda dataframe of original dataset?
I would also want to access other properties of vLLM result like logprobs along with text. Is it possible?

RunLLM · December 12, 2025, 6:18pm

The processor does not modify the original dataset in-place; it returns a new Ray Dataset with additional columns (such as generated_text or any fields you specify in your postprocess function). Your original data remains unchanged, and you can control which columns are included in the output by customizing the postprocess lambda. If you convert the result to a pandas DataFrame, the new columns will appear alongside the original ones, but the original DataFrame is not altered unless you explicitly overwrite it (Ray Data LLM docs, Anyscale LLM batch inference guide).

As of Ray 2.50, you can access additional vLLM result properties like logprobs by including them in your postprocess function, provided you set logprobs in the sampling_params during preprocessing. The logprobs field is now exposed in the output row, so you can extract it just like generated_text (Ray GitHub issue #58894, Ray GitHub issue #57586).

Would you like a code example showing how to access logprobs and other fields in the result?

Sources:

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 6:31pm

@RunLLM yes lets go through a code example for accessing logprobs and text.

RunLLM · December 12, 2025, 6:31pm

Certainly! To access both generated_text and logprobs from vLLM in Ray Data, set logprobs in your sampling_params and extract both fields in your postprocess function. Here’s a concise example:

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="unsloth/Llama-3.1-8B-Instruct",
    concurrency=1,
    batch_size=32,
    engine_kwargs={"max_model_len": 4096},
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: {
        "messages": [{"role": "user", "content": row["prompt"]}],
        "sampling_params": {"temperature": 0.7, "max_tokens": 100, "logprobs": True},
    },
    postprocess=lambda row: {
        "prompt": row["prompt"],
        "generated_text": row["generated_text"],
        "logprobs": row["logprobs"],  # logprobs now available in output
    },
)

ds = ray.data.from_items([{"prompt": "What is machine learning?"}])
result_ds = processor(ds)
for result in result_ds.iter_rows():
    print("Prompt:", result["prompt"])
    print("Text:", result["generated_text"])
    print("Logprobs:", result["logprobs"])

This will print both the generated text and the logprobs for each row. The logprobs field is included in the output row when requested in sampling_params (Ray Data LLM logprobs support).

Sources:

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 6:35pm

@RunLLM what does preprocess here mean? I thought SamplingParams would be in vLLmEngingProcessorConfig. My understanding is you add pre and post processing to apply your custom transformations on the data. Instead we are passing messages and sampling_params?

RunLLM · December 12, 2025, 6:35pm

In Ray Data LLM, the preprocess function is used to transform each input row into the format expected by the LLM engine (such as vLLM). This includes constructing the messages (in OpenAI chat format) and specifying sampling_params for each request. While global engine settings (like model, batch size, etc.) are set in vLLMEngineProcessorConfig, per-request generation parameters (e.g., temperature, max_tokens, logprobs) must be provided in the sampling_params dictionary within the preprocess function. This allows you to customize inference behavior for each row (Ray Data LLM docs, Anyscale LLM batch inference guide).

So, preprocess is not just for custom data transformations—it is required to supply the LLM with the correct input structure and per-row parameters. The engine config sets global/static options, while preprocess provides dynamic, row-level inference instructions.

Would you like more detail or a breakdown of how these two layers interact?

Sources:

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 6:44pm

@RunLLM thank you and what postprocessing does is it accesses the result and generates a dictionary for each row. Can you instead show me an example where I can create json of generated_text and logprobs in postprocess and later attach this json for each row in the original dataset as new column named result ?

RunLLM · December 12, 2025, 6:44pm

It seems likely that you can use the postprocess function to create a JSON object containing both generated_text and logprobs, and then attach this as a new column (e.g., result) in the output dataset. The postprocess lambda receives each row (including the LLM outputs), so you can construct and serialize your desired JSON structure there (Ray Data LLM docs).

Here’s an example:

import ray
import json
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="unsloth/Llama-3.1-8B-Instruct",
    concurrency=1,
    batch_size=32,
    engine_kwargs={"max_model_len": 4096},
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: {
        "messages": [{"role": "user", "content": row["prompt"]}],
        "sampling_params": {"temperature": 0.7, "max_tokens": 100, "logprobs": True},
    },
    postprocess=lambda row: {
        **row,
        "result": json.dumps({
            "generated_text": row["generated_text"],
            "logprobs": row["logprobs"],
        }),
    },
)

ds = ray.data.from_items([{"prompt": "What is machine learning?"}])
result_ds = processor(ds)
for result in result_ds.iter_rows():
    print(result["result"])  # This is your JSON string

# The 'result' column will be present in the output dataset, containing the JSON for each row.

This approach does not modify your original dataset in-place, but the output dataset will have a new result column as requested.

Sources:

Ray Data LLM docs

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 6:51pm

@RunLLM on the result dataset can I use the write_datasink? I have custom data sink implementation

RunLLM · December 12, 2025, 6:52pm

Yes, you can use write_datasink on the result dataset, including with your custom datasink implementation. After processing with the LLM processor, simply call result_ds.write_datasink(your_custom_datasink). Ray Data supports writing to custom datasinks by subclassing ray.data.Datasink and passing your instance to write_datasink (Ray Data docs, Ray Data source code).

This allows you to write the processed dataset (with your new result column) to any destination or format you define in your custom datasink.

Sources:

Ray Data Dataset.write_datasink source

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 7:15pm

@RunLLM what version of vLLM is supported in this approach of offline inference? Can I expect continuous vLLM upgrade?

RunLLM · December 12, 2025, 7:16pm

Ray Data LLM batch inference officially supports vLLM versions >= 0.7.2, with ongoing upgrades to support newer releases (e.g., 0.10.2, 0.12.0). The Ray team regularly updates compatibility, but breaking changes in vLLM may require Ray upgrades as well. You should check the Ray release notes and Ray Data LLM documentation for the latest supported vLLM version.

Continuous vLLM upgrades are expected, but always verify compatibility before upgrading, as some vLLM changes may temporarily break Ray integration until patched (Ray PR #58026, Ray PR #56535).

Sources:

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 8:32pm

@RunLLM for the vision-language model, does this API encapsulate model family specific template processing?

In my current implementation I am using Qwen2.5 and in the map_batches approach this is how I build request

process_vision_info is from qwen-vl-utils and this is required to build vllm request.

messages = []

messages.append({“role”: “system”, “content”: “Some System Prompt”})

user_content = 

user_content.append({“type”: “image”, “image”: pil_image})
user_content.append({“type”: “text”, “text”: text_prompt})

messages.append({“role”: “user”, “content”: user_content})
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
mm_data = {}

image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True) (From qwen-utils)
mm_data[“image”] = image_inputs

llm_input = {“prompt”: prompt}
llm_input[“multi_modal_data”] = mm_data

RunLLM · December 12, 2025, 8:32pm

Yes, the Ray Data LLM API (using build_llm_processor with vLLMEngineProcessorConfig) encapsulates model family–specific template processing for vision-language models like Qwen2.5. You do not need to manually call apply_chat_template or process_vision_info—the processor handles chat template formatting and multimodal data extraction internally, as long as you provide the messages field in OpenAI chat format with image/text content (Ray Data LLM docs, Ray VLM example).

For Qwen2.5, simply structure your messages as shown in the Ray VLM example, and set has_image=True in the config. The processor will apply the correct template and multimodal processing for the model family. You do not need to use process_vision_info or build the vLLM request manually.

Would you like a code example for Qwen2.5 with Ray Data LLM?

Sources:

Ray VLM example: vlm_example.py

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 8:44pm

@RunLLM yes lets go through an example. Say in the dataset I have col1 which I want to use as text prompt and col2 which has image bytes. I have a static text for system prompt like “Some System Prompt”.

RunLLM · December 12, 2025, 8:45pm

Here’s a Ray Data LLM example for Qwen2.5 VLM, where col1 is your text prompt and col2 contains image bytes. The processor will handle chat template and multimodal formatting automatically:

import ray
from PIL import Image
from io import BytesIO
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="Qwen/Qwen2.5-VL-3B-Instruct",
    has_image=True,
    batch_size=16,
    concurrency=1,
    engine_kwargs={"max_model_len": 4096},
)

def vision_preprocess(row):
    return {
        "messages": [
            {"role": "system", "content": "Some System Prompt"},
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": Image.open(BytesIO(row["col2"]))},
                    {"type": "text", "text": row["col1"]},
                ],
            },
        ],
        "sampling_params": {"temperature": 0.3, "max_tokens": 150},
    }

def vision_postprocess(row):
    return {"result": row["generated_text"], "col1": row["col1"]}

ds = ray.data.from_items([
    {"col1": "Describe this image.", "col2": open("your_image.jpg", "rb").read()},
    # ... more rows ...
])

processor = build_llm_processor(
    config,
    preprocess=vision_preprocess,
    postprocess=vision_postprocess,
)

result_ds = processor(ds)
for row in result_ds.iter_rows():
    print(row["col1"], "->", row["result"])

This will use your text and image columns, apply the system prompt, and output the generated result. No manual template or multimodal processing is needed (Ray VLM example).

Sources:

Ray VLM example: vlm_example.py

Hint: Mention @RunLLM in the post for followups.

Monil · December 12, 2025, 8:57pm

@RunLLM and what about llm.embed that gives vLLM generated embeddings. Is it supported too?

RunLLM · December 12, 2025, 8:57pm

Yes, embedding generation with vLLM is supported in Ray Data and Ray Serve. For Ray Data, set task_type="embed" in vLLMEngineProcessorConfig and use the processor as usual; for Ray Serve, set engine_kwargs={"task": "embed"} in your LLM config and use the /v1/embeddings endpoint (Ray Data LLM docs, Ray Serve LLM vLLM compatibility).

Would you like a code example for embedding generation?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
About the Ray Data LLM APIs category Ray Data LLM APIs	0	44	April 2, 2025
Does RayData Support multi-node vllm inference Ray Data LLM APIs	2	430	May 23, 2025
Ray Serve LLM APIs has 2~3x higher latency Ray Serve LLM APIs	7	349	May 19, 2025
vLLM Inferencing on multiGPU Ray Serve	7	1356	September 24, 2024
Does map_batches avoid saturating the inference engine? Ray Data LLM APIs	1	64	May 25, 2025

Offline inference vLLM: map_batches vs build_llm_processor

Related topics