@RunLLM lets talk about scaling. vLLM supports tensor parallelism. I know ray supports data parallelism. For tensor parallelism in my current workflow I take direct approach by setting engine params like tensor_parallel_size and VLLM_WORKER_MULTIPROC_METHOD. What type of scaling is supported in this integration of Ray and vLLM?
Ray’s integration with vLLM supports both data parallelism (multiple independent model replicas) and model parallelism (tensor parallelism, pipeline parallelism, or both). You can configure tensor parallelism and pipeline parallelism by setting tensor_parallel_size and pipeline_parallel_size in engine_kwargs of your processor or LLM config. Ray will automatically orchestrate cross-node and multi-GPU scaling, including placement group strategies for optimal resource allocation (Ray Data LLM docs, Ray Serve LLM cross-node parallelism).
You do not need to manually set VLLM_WORKER_MULTIPROC_METHOD—Ray handles process orchestration and placement. Both data and model parallelism can be combined: for example, you can run multiple replicas (data parallelism), each using tensor and/or pipeline parallelism across multiple GPUs or nodes.
Would you like a code example for configuring both data and tensor parallelism?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM yes code example would be helpful. Also I am curios to learn about scaling issues encapsulated in this integration which normally in map_batches approach users would have to handle themselves. What are some out of box benefits for scaling in this integration?
Here’s a code example for combining data parallelism (multiple replicas) and tensor/pipeline parallelism (multi-GPU per replica) in Ray Data LLM with vLLM:
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
config = vLLMEngineProcessorConfig(
model_source="unsloth/Llama-3.1-8B-Instruct",
engine_kwargs={
"tensor_parallel_size": 2, # 2 GPUs per replica (tensor parallelism)
"pipeline_parallel_size": 2, # 2 pipeline stages (total 4 GPUs per replica)
"max_model_len": 16384,
},
concurrency=3, # 3 replicas (data parallelism)
batch_size=32,
)
processor = build_llm_processor(
config,
preprocess=lambda row: {...},
postprocess=lambda row: {...},
)
# Apply processor to your dataset as usual
This will launch 3 independent model replicas, each using 4 GPUs (2 TP × 2 PP), for a total of 12 GPUs if available (Ray Data LLM docs, Ray example: basic_llm_example.py).
Out-of-the-box scaling benefits in Ray Data LLM vs. manual map_batches:
- Automatic resource orchestration: Ray handles GPU/CPU allocation, placement groups, and parallelism, so you don’t need to manage process spawning or device assignment.
- Fault tolerance: Failed batches are retried automatically; you don’t need to write custom error handling or checkpointing.
- Elastic scaling: Easily scale up or down by changing
concurrencyor parallelism settings—Ray manages cluster resources and scheduling. - Efficient batching and scheduling: Ray Data LLM uses continuous batching and intelligent scheduling to maximize GPU utilization, reducing idle time.
- Cross-node support: Ray can distribute model shards and replicas across multiple nodes, handling placement and communication.
- Simplified API: No need to manually manage vLLM engine lifecycles, process pools, or inter-process communication.
These features reduce operational complexity and make large-scale, distributed LLM inference production-ready.
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM
**Automatic resource orchestration:**I thought this is already supported in manual approach
Fault tolerance: Does this support checkpointing of data? Suppose I have a batch of 5 records and it fails after during inference of 3 records. Is this information checkpointed? In the re-try attempt will it do inference on remaining 2 records?
Efficient batching and scheduling: Is this not supported in manual approach too?
Cross-node support: Tell me more about this. How is this not done in manual approach?
Automatic resource orchestration: I thought this is already supported in manual approach
Fault tolerance: Does this support checkpointing of data? Suppose I have a batch of 5 records and it fails after during inference of 3 records. Is this information checkpointed? In the re-try attempt will it do inference on remaining 2 records?
Efficient batching and scheduling: Is this not supported in manual approach too?
Cross-node support: Tell me more about this. How is this not done in manual approach?
Automatic resource orchestration and efficient batching/scheduling can be implemented manually, but Ray Data LLM provides these out-of-the-box, including dynamic scaling, actor pool management, and continuous batching with vLLM, without requiring custom code for device assignment, process management, or scheduling (Ray Data LLM docs).
For fault tolerance, Ray Data LLM supports job-level checkpointing: if a batch fails, only unfinished records are retried. Completed records are checkpointed, so on restart, only the remaining records are processed—this is not available in standard manual map_batches, which typically requires restarting the entire job (Anyscale Docs: Ray Data job-level checkpointing, Ray GitHub issue #55008).
Cross-node support: Ray Data LLM and Ray Serve LLM natively support model and data parallelism across multiple nodes, automatically handling placement groups, resource bundling, and inter-node communication. Manual approaches require you to implement distributed process management, device mapping, and communication logic, which is complex and error-prone (Ray Serve LLM cross-node parallelism, Ray PR #56980).
Would you like a step-by-step breakdown of how checkpointing or cross-node placement works?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM could you provide some example for cross node support? How in the manual approach it can be done vs how it is supported in Ray Data LLM?
In the manual approach, cross-node support for vLLM typically requires you to manually launch distributed vLLM engine processes on each node, manage inter-process communication, set up distributed environment variables (like MASTER_ADDR, RANK, etc.), and coordinate placement and resource allocation—this is complex and error-prone (Ray GitHub issue #53192).
With Ray Data LLM, cross-node parallelism is natively supported: you simply set tensor_parallel_size, pipeline_parallel_size, and distributed_executor_backend="ray" in engine_kwargs, and Ray automatically handles placement, resource bundling, and communication across nodes. Example:
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
config = vLLMEngineProcessorConfig(
model_source="unsloth/Llama-3.1-8B-Instruct",
engine_kwargs={
"tensor_parallel_size": 2,
"pipeline_parallel_size": 2,
"distributed_executor_backend": "ray", # enables cross-node
"max_model_len": 16384,
},
batch_size=32,
concurrency=1,
)
processor = build_llm_processor(
config,
preprocess=lambda row: {...},
postprocess=lambda row: {...},
)
# Apply processor to your dataset as usual
Ray will automatically distribute the 4 GPUs (2 TP × 2 PP) across nodes as needed (Ray Data LLM docs, Ray example: basic_llm_example.py).
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM great discussion. One last question. What version of ray has started ray.data.llm integration? we are using 2.43 at the moment
The ray.data.llm integration was introduced in Ray version 2.44.0. It is not available in 2.43; you will need to upgrade to at least 2.44 to use these APIs (Ray GitHub issue #2e2a40d4, Ray Discourse).
Would you like upgrade guidance or details on breaking changes?
Sources:
- Ray Discourse: ray.serve.llm is only supported in version 2.43.0, not in 2.42.1
- Ray GitHub issue #2e2a40d4
Hint: Mention @RunLLM in the post for followups.