Can multiple Ray Data pipeline steps share the same large model instance for inference?

Hi,

In a Ray Data processing pipeline, I have multiple steps that all need to call the same large model for inference. Although their preprocessing and postprocessing may differ, the inference stage itself is exactly the same across these steps.

Is it possible for these steps to share the same model instance (e.g., the same GPU-backed actor or engine) to avoid loading multiple copies of the model into GPU memory? Right now, if each step starts its own instance, we don’t have enough GPUs to support this efficiently.

Has anyone managed to reuse the same model instance across multiple pipeline steps in Ray Data, or is there a recommended pattern for this scenario?

Thanks in advance for any advice.

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
import numpy as np

config = vLLMEngineProcessorConfig(
    model_source="unsloth/Llama-3.1-8B-Instruct",
    engine_kwargs={
        "enable_chunked_prefill": True,
        "max_num_batched_tokens": 4096,
        "max_model_len": 16384,
    },
    concurrency=1,
    batch_size=64,
)
processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        messages=[
            {"role": "system", "content": "You are a bot that responds with haikus."},
            {"role": "user", "content": row["item"]}
        ],
        sampling_params=dict(
            temperature=0.3,
            max_tokens=250,
        )
    ),
    postprocess=lambda row: dict(
        answer=row["generated_text"],
        **row  
    ),
)

ds = ray.data.from_items(["Start of the haiku is: Complete this for me..."])

ds = processor(ds)
ds = processor1(ds)
ds = processor2(ds)
ds = processor3(ds)
ds.show(limit=1)

Hi @donglin_hao,

The solution for this use-case is to deploy the shared model via ray serve llm and then use http processor from ray data llm to hit it. Will that work for you?

Sorry for the late reply. Using the Ray server approach, is it stable under high concurrency? It also seems less efficient compared to using an actor pool that shares the model for inference.
Is there a plan for Ray Data to allow multiple steps to share an actor pool?

There is some thread on this here, will try to address it there. But to answer your question, for high concurrency, in principle, it should scale linearly with more replicas. There is some overhead that we are actively addressing but yea there should be no difference compared to if you had used ray data to manage the engines.