Ray normal DAG vs Compiled DAG

RunLLM · October 17, 2025, 12:08am

In current dynamic execution graphs (normal Ray), communication setup for GPU tensors is not optimized for direct GPU-GPU transfer. When a task returns a torch CUDA tensor, Ray serializes it to CPU memory and stores it in the object store; when another task/actor needs it, the tensor is deserialized and copied from CPU to GPU, even if both actors are on GPUs. Direct GPU-GPU communication (e.g., via NCCL) is not natively supported in this flow—users must implement custom logic if they want to use NCCL or CUDA IPC themselves. See Ray discuss: ray.get on torch cuda tensors.

In static compiled graphs, Ray analyzes the entire DAG at compile time, identifies which tensors will be communicated between GPU actors, and sets up NCCL communicators or other accelerator-specific channels in advance. This enables native, zero-copy, direct GPU-to-GPU communication for torch CUDA tensors, bypassing the CPU and object store. The communication schedule is precomputed, and Ray can overlap communication and computation for further efficiency. See Compiled Graph docs and quickstart.

Would you like a code example or a deeper technical breakdown?

Sources:

Ray discuss: ray.get on torch cuda tensors

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Ray Compiled Graphs	2	34	October 14, 2025
Ray inferencing not happening in streaming way	7	426	December 13, 2023
[Core] Question on optimizing machine learning project speed using ray Ray Core	5	496	February 1, 2022
vLLM Inferencing on multiGPU Ray Serve	7	1346	September 24, 2024
Does RayData Support multi-node vllm inference Ray Data LLM APIs	2	420	May 23, 2025

Ray normal DAG vs Compiled DAG

Related topics