Ray normal DAG vs Compiled DAG

Monil · October 18, 2025, 5:52am

@RunLLM yes please can you give me an example of how to structure above pipeline to take advantage of GPU-GPU communication?

RunLLM · October 18, 2025, 5:52am

To enable direct GPU-GPU communication between actors in your pipeline, you need to (1) ensure that the data passed between actors is a torch CUDA tensor (not a CPU tensor or numpy array), and (2) use Ray Compiled Graph with the appropriate tensor transport configuration. Here’s a simplified example for a 3-phase pipeline (preprocessing → inference → postprocessing), where each phase is a Ray actor operating on GPU tensors, and the compiled graph is set up for NCCL-based GPU-GPU transfer:

import ray
import torch
from ray.dag import InputNode

@ray.remote(num_gpus=1)
class PreprocessActor:
    def __init__(self, pt2_path):
        self.model = torch.jit.load(pt2_path).cuda()
        self.model.eval()
    def process(self, batch):
        with torch.inference_mode():
            # Assume batch is a CUDA tensor
            return self.model(batch)

@ray.remote(num_gpus=1)
class InferenceActor:
    def __init__(self, pt2_path):
        self.model = torch.jit.load(pt2_path).cuda()
        self.model.eval()
    def infer(self, batch):
        with torch.inference_mode():
            return self.model(batch)

@ray.remote(num_gpus=1)
class PostprocessActor:
    def __init__(self, pt2_path):
        self.model = torch.jit.load(pt2_path).cuda()
        self.model.eval()
    def post(self, batch):
        with torch.inference_mode():
            return self.model(batch)

# Instantiate actors
pre = PreprocessActor.remote("preprocess.pt2")
inf = InferenceActor.remote("inference.pt2")
post = PostprocessActor.remote("postprocess.pt2")

# Build compiled graph with GPU-GPU transport
with InputNode() as inp:
    # Ensure input is a CUDA tensor
    dag = pre.process.bind(inp)
    dag = dag.with_tensor_transport("nccl")  # Enable NCCL GPU-GPU transfer
    dag = inf.infer.bind(dag)
    dag = dag.with_tensor_transport("nccl")
    dag = post.post.bind(dag)
    dag = dag.with_tensor_transport("nccl")

compiled = dag.experimental_compile()

# Run the pipeline with a CUDA tensor input
input_tensor = torch.randn(32, 10, device="cuda")
result = ray.get(compiled.execute(input_tensor))
print(result)

Key points:

Each actor method receives and returns a CUDA tensor.
.with_tensor_transport("nccl") tells Ray to use direct GPU-GPU communication (NCCL) between actors.
The compiled graph is statically defined and optimized for GPU tensor transfer.

This structure ensures that data never leaves GPU memory between pipeline stages, maximizing efficiency in multi-GPU and multi-node setups.

Sources:

Compiled Graph example code

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Ray Compiled Graphs	2	34	October 14, 2025
Ray inferencing not happening in streaming way	7	426	December 13, 2023
[Core] Question on optimizing machine learning project speed using ray Ray Core	5	496	February 1, 2022
vLLM Inferencing on multiGPU Ray Serve	7	1346	September 24, 2024
Does RayData Support multi-node vllm inference Ray Data LLM APIs	2	419	May 23, 2025

Ray normal DAG vs Compiled DAG

Related topics