Ray normal DAG vs Compiled DAG

In your described setup, the output of the PreprocessingActor (on node 1, possibly using GPU) is passed to the InferenceActor (which may be on the same or a different node). By default, Ray serializes the output (even if it’s a torch.Tensor on GPU) to CPU memory and stores it in the object store. When the InferenceActor receives the data, it is deserialized from CPU memory and, if needed, copied to GPU. So, the actor-actor communication is CPU-CPU by default, not direct GPU-GPU—even if both actors are on the same node and both use GPUs (Ray discuss: ray.get on torch cuda tensors).

Direct GPU-GPU actor-actor communication (zero-copy, NCCL, etc.) is only available with Ray Compiled Graph or Ray Direct Transport features, and requires explicit configuration. Otherwise, all inter-actor data transfer goes through CPU/object store.

Would you like to see how to enable direct GPU-GPU communication between actors?

Sources:

Hint: Mention @RunLLM in the post for followups.