Ray normal DAG vs Compiled DAG

@RunLLM thank you for clarifying the steps. I have similar setup in my batch. So comparing this setup with the compiled graph, is the main difference about handling GPU-GPU communication by vLLM natively vs the Ray supported compiled graph? which one is better?

Yes, the main difference is that in your current setup, vLLM (or another model-parallel framework) handles GPU-GPU communication internally within each actor, while Ray Compiled Graph enables Ray itself to orchestrate direct GPU-GPU (and potentially multi-actor, multi-node) communication using optimized channels like NCCL, with global scheduling and communication overlap. Compiled Graph can optimize resource allocation, communication, and scheduling across the entire pipeline, not just within a single actor.

Which is better depends on your use case:

  • If your workload is fully contained within a single actor (e.g., vLLM manages all intra-actor parallelism and communication), vLLM’s native approach is sufficient and simple.
  • If you need to coordinate GPU-GPU communication across multiple actors or nodes, or want to minimize system overhead and maximize throughput/latency for complex pipelines, Ray Compiled Graph provides more global optimization and flexibility (see docs).

Would you like a deeper technical comparison or example?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM yes please lets go ahead with technical comparison or better with an example

Certainly! Here’s a technical comparison with an example:

1. vLLM Native Model Parallelism (Current Setup):

  • vLLM internally shards the model across multiple GPUs within a single Ray actor.
  • All GPU-GPU communication (e.g., tensor parallelism, pipeline parallelism) is managed by vLLM using NCCL or similar libraries.
  • Ray’s role is to schedule the actor on a node with enough GPUs; Ray does not orchestrate or optimize GPU-GPU communication between actors.
  • Data flow: Ray → Actor (vLLM) → Ray. No direct actor-to-actor GPU tensor transfer.
  • Example:
    config = vLLMEngineProcessorConfig(
        model_source="unsloth/Llama-3.1-8B-Instruct",
        engine_kwargs={
            "tensor_parallel_size": 2,
            "pipeline_parallel_size": 2,  # 4 GPUs per actor
        },
        batch_size=32,
        concurrency=1,
    )
    # Each actor loads the model sharded across 4 GPUs; vLLM handles all GPU comms.
    
  • Best for: Single-node, single-actor, or when vLLM’s parallelism is sufficient.

2. Ray Compiled Graph:

  • The entire computation graph (across actors, possibly across nodes) is statically defined and compiled.
  • Ray sets up NCCL communicators and communication channels between actors in advance, enabling direct, zero-copy GPU-GPU communication between actors (not just within one actor).
  • Ray can overlap compute and communication, pre-allocate resources, and globally optimize scheduling and data movement.
  • Data flow: Ray → Actor1 (GPU) → Actor2 (GPU) … with direct GPU tensor transfer between actors.
  • Example (from docs):
    import ray
    from ray.dag import InputNode
    
    @ray.remote(num_gpus=1)
    class Worker:
        def process(self, x):
            # x can be a torch.Tensor on GPU
            return x * 2
    
    w1 = Worker.remote()
    w2 = Worker.remote()
    
    with InputNode() as inp:
        dag = w1.process.bind(inp)
        dag = dag.with_tensor_transport(transport="nccl")
        dag = w2.process.bind(dag)
    
    compiled = dag.experimental_compile()
    result = ray.get(compiled.execute(torch.ones(10, device="cuda")))
    
  • Best for: Multi-actor, multi-node, or complex pipelines where you want Ray to optimize and orchestrate GPU-GPU communication globally.

Summary Table:

Feature vLLM Native Parallelism Ray Compiled Graph
GPU comm. within actor Yes (vLLM/NCCL) Yes
GPU comm. between actors No (CPU fallback) Yes (NCCL, zero-copy)
Multi-node comm. optimization No Yes
Scheduling optimization Per-actor Global, DAG-level
Overlap compute/comm Limited (by vLLM) Yes (Ray-managed)
Use case Single actor/node Multi-actor/node pipelines

Would you like a more detailed code example or a benchmark comparison?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM awesome. so then I would like to understand this

  1. what do you mean by “workload fully contained within a single actor” ? I am also curious about learning an example of inference where GPU-GPU communication between two or more actors is required? When typically multi-actor inference happens?Also lets discuss what happens when node with desired number of GPU is not available then how does vLLM does the tensor parallelism or model parallelism. Is this then done on the multiple nodes? then in that case actor-actor communication between two nodes are inevitable in my understanding? Is Ray Compiled Graph better in this scenario?

A “workload fully contained within a single actor” means all computation, model sharding, and GPU-GPU communication happen inside one Ray actor process. For example, vLLM or DeepSpeed can shard a model across multiple GPUs on a node, but all GPUs are managed by the same actor, and no tensors are passed between different Ray actors—so Ray itself never needs to coordinate GPU-GPU communication between actors (see vLLM parallelism).

Multi-actor inference (where GPU-GPU communication between actors is needed) typically happens in these scenarios:

  • Model is too large for a single node: You need to shard the model across GPUs on multiple nodes, so each actor manages a subset of GPUs and must exchange tensors/results with other actors.
  • Pipeline or ensemble models: Each actor runs a different stage or sub-model, and outputs (often as GPU tensors) must be passed directly from one actor to another for efficiency.
  • Distributed data-parallel or expert-parallel (MoE) inference: Each actor processes a partition of the data/model and must exchange intermediate results or activations with others.

Example:
Suppose you have a two-stage pipeline: Stage 1 actor runs a feature extractor on GPU, and Stage 2 actor runs a classifier on GPU. For maximum efficiency, you want to pass the intermediate tensor directly from Stage 1’s GPU to Stage 2’s GPU, without copying through CPU/object store. This is where Ray Compiled Graph’s direct GPU-GPU actor communication is valuable (see compiled graph docs).

Would you like a code example of such a multi-actor GPU pipeline?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM thinking about multi-actor inference, what happens when node with desired number of GPU is not available for a particular actor. Then how does vLLM do the tensor parallelism or model parallelism?. Is this then done on the multiple nodes? then in that case actor-actor communication between two nodes are inevitable in my understanding?

Correct—if a single node does not have enough GPUs for the desired tensor/pipeline parallelism, vLLM (and similar frameworks) can perform model parallelism across multiple nodes. In this scenario, the model is sharded across GPUs on different nodes, and inter-node communication is required for tensor/pipeline parallelism. This means actor-to-actor (or process-to-process) communication between nodes is inevitable, and vLLM uses Ray’s distributed backend (with NCCL or similar) to coordinate this (see vLLM multi-node TP/PP support).

Recent Ray and vLLM releases support multi-node tensor/pipeline parallelism by relaxing placement constraints, allowing actors to be scheduled on different nodes and enabling cross-node GPU communication. However, this introduces additional communication overhead, and optimal performance is typically achieved when all GPUs are on the same node (see benchmark results).

Would you like more detail on how to configure or optimize multi-node model parallelism?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM so we provide num_gpus in the map_batches as a parameter. Ray schedules a node for this actor/function processing. Now suppose the node Ray chooses does not have enough GPUs. That means Ray must have chosen more nodes for this actor. So how this information is passed to vLLM getting instantiated inside that actor? How does vLLM know what nodes are selected for this actor?

When you specify num_gpus in map_batches, Ray tries to schedule each actor on a node with at least that many GPUs. If a single node does not have enough GPUs, Ray will not split the actor across multiple nodes; instead, the actor will remain pending until a node with sufficient GPUs is available. Ray actors are always placed on a single node, and num_gpus refers to GPUs on that node only (Ray actor resource docs).

For true multi-node model parallelism (e.g., vLLM tensor/pipeline parallelism across nodes), you must use vLLM’s distributed backend with a placement group configuration that allows cross-node placement. In this case, Ray Serve or Ray Data (with vLLM) creates a placement group with resource bundles that may span multiple nodes. Ray passes the placement group and bundle information to the vLLM engine, which then uses Ray’s APIs to discover the node and GPU assignments for each worker. vLLM uses this information to initialize distributed communication (e.g., with NCCL) across the selected nodes (vLLM multi-node TP/PP support, Ray Serve LLM placement group config).

Would you like a code example of how to configure multi-node placement for vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM yes lets go through code example of multi-node setup

Here’s a code example of configuring Ray Data with vLLM for multi-node tensor/pipeline parallelism, so the model can be sharded across GPUs on multiple nodes:

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

# Placement group config: 8 bundles (1 GPU + 1 CPU each), PACK strategy allows cross-node placement
placement_group_config = {
    "bundles": [{"GPU": 1, "CPU": 1}] * 8,  # 8 total GPUs (could be 4 GPUs on 2 nodes, etc.)
    "strategy": "PACK",
}

config = vLLMEngineProcessorConfig(
    model_source="facebook/opt-1.3b",
    engine_kwargs=dict(
        tensor_parallel_size=4,      # 4-way tensor parallelism
        pipeline_parallel_size=2,    # 2-way pipeline parallelism (total 8 GPUs)
        distributed_executor_backend="ray",  # Enable Ray distributed backend for multi-node
        max_model_len=4096,
        max_num_batched_tokens=2048,
    ),
    placement_group_config=placement_group_config,
    concurrency=1,
    batch_size=32,
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        prompt=f"Prompt: {row['item']}",
        sampling_params=dict(
            temperature=0.3,
            max_tokens=100,
        ),
    ),
    postprocess=lambda row: dict(
        resp=row["generated_text"],
    ),
)

ds = ray.data.from_items([{"item": f"Example {i}"} for i in range(100)])
ds = processor(ds)
ds = ds.materialize()
print(ds.take_all())

Key points:

  • The placement_group_config with multiple bundles and the “PACK” strategy allows Ray to allocate GPUs across multiple nodes.
  • tensor_parallel_size and pipeline_parallel_size together determine the total number of GPUs (here, 4 × 2 = 8).
  • vLLM, when using the "ray" distributed backend, will automatically discover the node and GPU assignments for each worker and set up NCCL communication across nodes.
  • This setup is tested in Ray’s multi-node vLLM tests and placement group PR.

Would you like to see a Ray Serve (online inference) multi-node example as well?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM so to summarize, when it comes to multi-GPU inference for LLM on single node setup let it handle through vLLM (not sure if other engine like SGLang supports this). Ray Compiled graph is not much of help here. It truly shines when multi-node actor setup is in place.

Suppose inference job has 3 phase 1) preprocessing, torch.nn module or .pt2 which does data preprocessing 2) inference, .pt2 which takes input from preprocessing 3) postprocessing which takes input from inference. All 3 types of inference are added through 3 map_batches(). In this case node would be different and then in this case it would be multi-node setup right?

Correct. For multi-GPU inference on a single node, vLLM (and similar engines) natively handle model sharding and GPU-GPU communication within the actor, so Ray Compiled Graph does not add much benefit. Engines like SGLang may not support advanced model parallelism like vLLM (see vLLM parallelism).

In your 3-phase pipeline (preprocessing → inference → postprocessing, each as a separate map_batches()), Ray may schedule each phase’s actors on different nodes, especially in a large cluster. This results in a multi-node setup, and if you want to pass GPU tensors efficiently between these phases (actors), Ray Compiled Graph can optimize direct GPU-GPU communication and global scheduling across nodes (see compiled graph docs).

Would you like a code example of such a multi-phase, multi-node pipeline?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM yes please could you give me an example setup above 3 phase pipeline. Suppose we have an (actor)class that takes .pt2 file during its init. Upon receving data from Ray Dataset its __call method is invoked by Ray which takes in batch of Ray Data block and using that .pt2 it does the inference.

Certainly! Here’s an example of a 3-phase Ray Data pipeline (preprocessing → inference → postprocessing), where each phase uses an actor class that loads a .pt2 (TorchScript) file in its __init__ and processes batches in __call__. Each phase is connected via map_batches, and Ray can schedule actors on different nodes for each phase.

import ray
import torch
import numpy as np

# Example actor for preprocessing
class PreprocessActor:
    def __init__(self, pt2_path):
        self.model = torch.jit.load(pt2_path)
        self.model.eval()

    def __call__(self, batch):
        # Assume batch is a dict with "data" key
        tensor = torch.as_tensor(batch["data"], dtype=torch.float32)
        with torch.inference_mode():
            processed = self.model(tensor)
        return {"preprocessed": processed.numpy()}

# Example actor for inference
class InferenceActor:
    def __init__(self, pt2_path):
        self.model = torch.jit.load(pt2_path)
        self.model.eval()

    def __call__(self, batch):
        tensor = torch.as_tensor(batch["preprocessed"], dtype=torch.float32)
        with torch.inference_mode():
            preds = self.model(tensor)
        return {"inference": preds.numpy()}

# Example actor for postprocessing
class PostprocessActor:
    def __init__(self, pt2_path):
        self.model = torch.jit.load(pt2_path)
        self.model.eval()

    def __call__(self, batch):
        tensor = torch.as_tensor(batch["inference"], dtype=torch.float32)
        with torch.inference_mode():
            post = self.model(tensor)
        return {"result": post.numpy()}

# Create Ray Dataset
ds = ray.data.from_items([{"data": np.random.rand(10)} for _ in range(1000)])

# 3-phase pipeline
ds = ds.map_batches(
    PreprocessActor,
    fn_constructor_kwargs={"pt2_path": "preprocess.pt2"},
    batch_size=32,
    num_gpus=1,
)
ds = ds.map_batches(
    InferenceActor,
    fn_constructor_kwargs={"pt2_path": "inference.pt2"},
    batch_size=32,
    num_gpus=1,
)
ds = ds.map_batches(
    PostprocessActor,
    fn_constructor_kwargs={"pt2_path": "postprocess.pt2"},
    batch_size=32,
    num_gpus=1,
)

# Collect results
print(ds.take(5))
  • Each map_batches launches actors (one per block) on available nodes with GPU.
  • Each actor loads its .pt2 model in __init__ and processes batches in __call__.
  • Ray schedules each phase independently, so preprocessing, inference, and postprocessing can run on different nodes if needed.

Would you like to see how to further optimize this with Ray Compiled Graph for direct GPU-GPU communication?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM so in this case there would be two map_batches() call on the ray dataset right? So this is actor-actor data transfer. Preprocessing does the inference and passes the data to InfereneActor. Preprocessing is done by an actor on multi-GPU setup on node 1. On node 1 GPU-GPU communication is not happening if the model can be loaded on one GPU. So within node there is no GPU-GPU communication but output of Preprocessing Actor is fed to the Inference Actor thats where actor-actor communication happens. But again is this actor-actor communication GPU-GPU or CPU-CPU?

In your described setup, the output of the PreprocessingActor (on node 1, possibly using GPU) is passed to the InferenceActor (which may be on the same or a different node). By default, Ray serializes the output (even if it’s a torch.Tensor on GPU) to CPU memory and stores it in the object store. When the InferenceActor receives the data, it is deserialized from CPU memory and, if needed, copied to GPU. So, the actor-actor communication is CPU-CPU by default, not direct GPU-GPU—even if both actors are on the same node and both use GPUs (Ray discuss: ray.get on torch cuda tensors).

Direct GPU-GPU actor-actor communication (zero-copy, NCCL, etc.) is only available with Ray Compiled Graph or Ray Direct Transport features, and requires explicit configuration. Otherwise, all inter-actor data transfer goes through CPU/object store.

Would you like to see how to enable direct GPU-GPU communication between actors?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM no but then even if this is multi-node setup Ray Compiled Graph is of no help here right? Because there is no GPU-GPU communication.

Correct—if your pipeline only passes CPU data (e.g., numpy arrays or CPU tensors) between actors, even in a multi-node setup, Ray Compiled Graph’s GPU-GPU communication optimizations are not utilized. The main benefit of Ray Compiled Graph in this context would be reduced system overhead and improved scheduling for repeated, static pipelines, but not direct GPU-GPU data transfer. Direct GPU-GPU communication is only leveraged if you explicitly pass GPU tensors (e.g., torch.Tensor on CUDA) between actors and configure the compiled graph for accelerator transport (Compiled Graph docs).

Would you like to see how to structure your pipeline to take advantage of GPU-GPU communication?

Sources:

Hint: Mention @RunLLM in the post for followups.