Hi,
In a Ray Data processing pipeline, I have multiple steps that all need to call the same large model for inference. Although their preprocessing and postprocessing may differ, the inference stage itself is exactly the same across these steps.
Is it possible for these steps to share the same model instance (e.g., the same GPU-backed actor or engine) to avoid loading multiple copies of the model into GPU memory? Right now, if each step starts its own instance, we don’t have enough GPUs to support this efficiently.
Has anyone managed to reuse the same model instance across multiple pipeline steps in Ray Data, or is there a recommended pattern for this scenario?
Thanks in advance for any advice.
import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
import numpy as np
config = vLLMEngineProcessorConfig(
model_source="unsloth/Llama-3.1-8B-Instruct",
engine_kwargs={
"enable_chunked_prefill": True,
"max_num_batched_tokens": 4096,
"max_model_len": 16384,
},
concurrency=1,
batch_size=64,
)
processor = build_llm_processor(
config,
preprocess=lambda row: dict(
messages=[
{"role": "system", "content": "You are a bot that responds with haikus."},
{"role": "user", "content": row["item"]}
],
sampling_params=dict(
temperature=0.3,
max_tokens=250,
)
),
postprocess=lambda row: dict(
answer=row["generated_text"],
**row
),
)
ds = ray.data.from_items(["Start of the haiku is: Complete this for me..."])
ds = processor(ds)
ds = processor1(ds)
ds = processor2(ds)
ds = processor3(ds)
ds.show(limit=1)