Ray normal DAG vs Compiled DAG

In current dynamic execution graphs (normal Ray), communication setup for GPU tensors is not optimized for direct GPU-GPU transfer. When a task returns a torch CUDA tensor, Ray serializes it to CPU memory and stores it in the object store; when another task/actor needs it, the tensor is deserialized and copied from CPU to GPU, even if both actors are on GPUs. Direct GPU-GPU communication (e.g., via NCCL) is not natively supported in this flow—users must implement custom logic if they want to use NCCL or CUDA IPC themselves. See Ray discuss: ray.get on torch cuda tensors.

In static compiled graphs, Ray analyzes the entire DAG at compile time, identifies which tensors will be communicated between GPU actors, and sets up NCCL communicators or other accelerator-specific channels in advance. This enables native, zero-copy, direct GPU-to-GPU communication for torch CUDA tensors, bypassing the CPU and object store. The communication schedule is precomputed, and Ray can overlap communication and computation for further efficiency. See Compiled Graph docs and quickstart.

Would you like a code example or a deeper technical breakdown?

Sources:

Hint: Mention @RunLLM in the post for followups.