Can multiple Ray Data pipeline steps share the same large model instance for inference?

donglin_hao · July 22, 2025, 1:39am

Hi,

In a Ray Data processing pipeline, I have multiple steps that all need to call the same large model for inference. Although their preprocessing and postprocessing may differ, the inference stage itself is exactly the same across these steps.

Is it possible for these steps to share the same model instance (e.g., the same GPU-backed actor or engine) to avoid loading multiple copies of the model into GPU memory? Right now, if each step starts its own instance, we don’t have enough GPUs to support this efficiently.

Has anyone managed to reuse the same model instance across multiple pipeline steps in Ray Data, or is there a recommended pattern for this scenario?

Thanks in advance for any advice.

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
import numpy as np

config = vLLMEngineProcessorConfig(
    model_source="unsloth/Llama-3.1-8B-Instruct",
    engine_kwargs={
        "enable_chunked_prefill": True,
        "max_num_batched_tokens": 4096,
        "max_model_len": 16384,
    },
    concurrency=1,
    batch_size=64,
)
processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        messages=[
            {"role": "system", "content": "You are a bot that responds with haikus."},
            {"role": "user", "content": row["item"]}
        ],
        sampling_params=dict(
            temperature=0.3,
            max_tokens=250,
        )
    ),
    postprocess=lambda row: dict(
        answer=row["generated_text"],
        **row  
    ),
)

ds = ray.data.from_items(["Start of the haiku is: Complete this for me..."])

ds = processor(ds)
ds = processor1(ds)
ds = processor2(ds)
ds = processor3(ds)
ds.show(limit=1)

kourosh · July 23, 2025, 5:45am

Hi @donglin_hao,

The solution for this use-case is to deploy the shared model via ray serve llm and then use http processor from ray data llm to hit it. Will that work for you?

donglin_hao · July 27, 2025, 4:29am

Sorry for the late reply. Using the Ray server approach, is it stable under high concurrency? It also seems less efficient compared to using an actor pool that shares the model for inference.
Is there a plan for Ray Data to allow multiple steps to share an actor pool?

kourosh · August 6, 2025, 10:14pm

There is some thread on this here, will try to address it there. But to answer your question, for high concurrency, in principle, it should scale linearly with more replicas. There is some overhead that we are actively addressing but yea there should be no difference compared to if you had used ray data to manage the engines.

Topic		Replies	Views
Sharing big ML models using only Ray Core Ray Core	1	417	July 6, 2022
Scalability and create shared memory for model instance Ray Core	10	437	August 20, 2021
[Core] Question on optimizing machine learning project speed using ray Ray Core	5	490	February 1, 2022
How to use the same set of actors in multiple non-adjacent processing steps Ray Data	0	18	May 30, 2025
Utilising Ray for Simple Parallelism (Batch Inference)	1	976	March 28, 2023

Can multiple Ray Data pipeline steps share the same large model instance for inference?

Related topics