Offline inference vLLM: map_batches vs build_llm_processor

Hey community,

I am doing offline inference with Ray. I see two approaches when it comes to vLLM. The first approach as mentioned here End-to-end: Offline Batch Inference — Ray 2.52.1 uses sequence of map_batches. I am following this approach in my current implementation where I have a class that initializes vLLM and calls its vLLM.generate method on the dataset.

But then I see, Ray supported version of doing offline inference as mentioned here Working with LLMs — Ray 2.52.1 where Ray provides build_llm_processor to set all configs.

Which one is better approach when it comes to LLM? Note: I am using custom datasource implementation.

The recommended approach for LLM offline inference in Ray is to use the build_llm_processor API from ray.data.llm, rather than manually initializing vLLM in a custom map_batches class. The build_llm_processor method provides a higher-level, production-ready interface that handles batching, resource orchestration, error handling, and integration with vLLM out of the box. It also supports advanced features like model parallelism, LoRA, embeddings, and vision-language models, and is designed for scalability and reliability in distributed settings. Manual map_batches with custom vLLM logic is more error-prone and lacks these optimizations and features, making it less suitable for large-scale or production workloads.

For custom data sources, you can still use build_llm_processor by applying it to your Ray Dataset after ingesting data from your source. This approach is documented and supported for both standard and custom pipelines. See Ray Data LLM documentation, Anyscale LLM batch inference guide, and example code for details.

Would you like a step-by-step example of using build_llm_processor with a custom data source?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I see the processor takes Dataset and returns result dataset. I have couple of questions on that.

  1. does it modify original dataset? If yes, how? does it add a new column in lets say panda dataframe of original dataset?
  2. I would also want to access other properties of vLLM result like logprobs along with text. Is it possible?

The processor does not modify the original dataset in-place; it returns a new Ray Dataset with additional columns (such as generated_text or any fields you specify in your postprocess function). Your original data remains unchanged, and you can control which columns are included in the output by customizing the postprocess lambda. If you convert the result to a pandas DataFrame, the new columns will appear alongside the original ones, but the original DataFrame is not altered unless you explicitly overwrite it (Ray Data LLM docs, Anyscale LLM batch inference guide).

As of Ray 2.50, you can access additional vLLM result properties like logprobs by including them in your postprocess function, provided you set logprobs in the sampling_params during preprocessing. The logprobs field is now exposed in the output row, so you can extract it just like generated_text (Ray GitHub issue #58894, Ray GitHub issue #57586).

Would you like a code example showing how to access logprobs and other fields in the result?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM yes lets go through a code example for accessing logprobs and text.

Certainly! To access both generated_text and logprobs from vLLM in Ray Data, set logprobs in your sampling_params and extract both fields in your postprocess function. Here’s a concise example:

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="unsloth/Llama-3.1-8B-Instruct",
    concurrency=1,
    batch_size=32,
    engine_kwargs={"max_model_len": 4096},
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: {
        "messages": [{"role": "user", "content": row["prompt"]}],
        "sampling_params": {"temperature": 0.7, "max_tokens": 100, "logprobs": True},
    },
    postprocess=lambda row: {
        "prompt": row["prompt"],
        "generated_text": row["generated_text"],
        "logprobs": row["logprobs"],  # logprobs now available in output
    },
)

ds = ray.data.from_items([{"prompt": "What is machine learning?"}])
result_ds = processor(ds)
for result in result_ds.iter_rows():
    print("Prompt:", result["prompt"])
    print("Text:", result["generated_text"])
    print("Logprobs:", result["logprobs"])

This will print both the generated text and the logprobs for each row. The logprobs field is included in the output row when requested in sampling_params (Ray Data LLM logprobs support).

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM what does preprocess here mean? I thought SamplingParams would be in vLLmEngingProcessorConfig. My understanding is you add pre and post processing to apply your custom transformations on the data. Instead we are passing messages and sampling_params?

In Ray Data LLM, the preprocess function is used to transform each input row into the format expected by the LLM engine (such as vLLM). This includes constructing the messages (in OpenAI chat format) and specifying sampling_params for each request. While global engine settings (like model, batch size, etc.) are set in vLLMEngineProcessorConfig, per-request generation parameters (e.g., temperature, max_tokens, logprobs) must be provided in the sampling_params dictionary within the preprocess function. This allows you to customize inference behavior for each row (Ray Data LLM docs, Anyscale LLM batch inference guide).

So, preprocess is not just for custom data transformations—it is required to supply the LLM with the correct input structure and per-row parameters. The engine config sets global/static options, while preprocess provides dynamic, row-level inference instructions.

Would you like more detail or a breakdown of how these two layers interact?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM thank you and what postprocessing does is it accesses the result and generates a dictionary for each row. Can you instead show me an example where I can create json of generated_text and logprobs in postprocess and later attach this json for each row in the original dataset as new column named result ?

It seems likely that you can use the postprocess function to create a JSON object containing both generated_text and logprobs, and then attach this as a new column (e.g., result) in the output dataset. The postprocess lambda receives each row (including the LLM outputs), so you can construct and serialize your desired JSON structure there (Ray Data LLM docs).

Here’s an example:

import ray
import json
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="unsloth/Llama-3.1-8B-Instruct",
    concurrency=1,
    batch_size=32,
    engine_kwargs={"max_model_len": 4096},
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: {
        "messages": [{"role": "user", "content": row["prompt"]}],
        "sampling_params": {"temperature": 0.7, "max_tokens": 100, "logprobs": True},
    },
    postprocess=lambda row: {
        **row,
        "result": json.dumps({
            "generated_text": row["generated_text"],
            "logprobs": row["logprobs"],
        }),
    },
)

ds = ray.data.from_items([{"prompt": "What is machine learning?"}])
result_ds = processor(ds)
for result in result_ds.iter_rows():
    print(result["result"])  # This is your JSON string

# The 'result' column will be present in the output dataset, containing the JSON for each row.

This approach does not modify your original dataset in-place, but the output dataset will have a new result column as requested.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM on the result dataset can I use the write_datasink? I have custom data sink implementation

Yes, you can use write_datasink on the result dataset, including with your custom datasink implementation. After processing with the LLM processor, simply call result_ds.write_datasink(your_custom_datasink). Ray Data supports writing to custom datasinks by subclassing ray.data.Datasink and passing your instance to write_datasink (Ray Data docs, Ray Data source code).

This allows you to write the processed dataset (with your new result column) to any destination or format you define in your custom datasink.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM what version of vLLM is supported in this approach of offline inference? Can I expect continuous vLLM upgrade?

Ray Data LLM batch inference officially supports vLLM versions >= 0.7.2, with ongoing upgrades to support newer releases (e.g., 0.10.2, 0.12.0). The Ray team regularly updates compatibility, but breaking changes in vLLM may require Ray upgrades as well. You should check the Ray release notes and Ray Data LLM documentation for the latest supported vLLM version.

Continuous vLLM upgrades are expected, but always verify compatibility before upgrading, as some vLLM changes may temporarily break Ray integration until patched (Ray PR #58026, Ray PR #56535).

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM for the vision-language model, does this API encapsulate model family specific template processing?

In my current implementation I am using Qwen2.5 and in the map_batches approach this is how I build request

process_vision_info is from qwen-vl-utils and this is required to build vllm request.

messages = []

messages.append({“role”: “system”, “content”: “Some System Prompt”})

user_content = 

user_content.append({“type”: “image”, “image”: pil_image})
user_content.append({“type”: “text”, “text”: text_prompt})

messages.append({“role”: “user”, “content”: user_content})
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
mm_data = {}

image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True) (From qwen-utils)
mm_data[“image”] = image_inputs

llm_input = {“prompt”: prompt}
llm_input[“multi_modal_data”] = mm_data

Yes, the Ray Data LLM API (using build_llm_processor with vLLMEngineProcessorConfig) encapsulates model family–specific template processing for vision-language models like Qwen2.5. You do not need to manually call apply_chat_template or process_vision_info—the processor handles chat template formatting and multimodal data extraction internally, as long as you provide the messages field in OpenAI chat format with image/text content (Ray Data LLM docs, Ray VLM example).

For Qwen2.5, simply structure your messages as shown in the Ray VLM example, and set has_image=True in the config. The processor will apply the correct template and multimodal processing for the model family. You do not need to use process_vision_info or build the vLLM request manually.

Would you like a code example for Qwen2.5 with Ray Data LLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM yes lets go through an example. Say in the dataset I have col1 which I want to use as text prompt and col2 which has image bytes. I have a static text for system prompt like “Some System Prompt”.

Here’s a Ray Data LLM example for Qwen2.5 VLM, where col1 is your text prompt and col2 contains image bytes. The processor will handle chat template and multimodal formatting automatically:

import ray
from PIL import Image
from io import BytesIO
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="Qwen/Qwen2.5-VL-3B-Instruct",
    has_image=True,
    batch_size=16,
    concurrency=1,
    engine_kwargs={"max_model_len": 4096},
)

def vision_preprocess(row):
    return {
        "messages": [
            {"role": "system", "content": "Some System Prompt"},
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": Image.open(BytesIO(row["col2"]))},
                    {"type": "text", "text": row["col1"]},
                ],
            },
        ],
        "sampling_params": {"temperature": 0.3, "max_tokens": 150},
    }

def vision_postprocess(row):
    return {"result": row["generated_text"], "col1": row["col1"]}

ds = ray.data.from_items([
    {"col1": "Describe this image.", "col2": open("your_image.jpg", "rb").read()},
    # ... more rows ...
])

processor = build_llm_processor(
    config,
    preprocess=vision_preprocess,
    postprocess=vision_postprocess,
)

result_ds = processor(ds)
for row in result_ds.iter_rows():
    print(row["col1"], "->", row["result"])

This will use your text and image columns, apply the system prompt, and output the generated result. No manual template or multimodal processing is needed (Ray VLM example).

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM and what about llm.embed that gives vLLM generated embeddings. Is it supported too?

Yes, embedding generation with vLLM is supported in Ray Data and Ray Serve. For Ray Data, set task_type="embed" in vLLMEngineProcessorConfig and use the processor as usual; for Ray Serve, set engine_kwargs={"task": "embed"} in your LLM config and use the /v1/embeddings endpoint (Ray Data LLM docs, Ray Serve LLM vLLM compatibility).

Would you like a code example for embedding generation?

Sources:

Hint: Mention @RunLLM in the post for followups.