This might be a repeat question, if so please link.
I am curious what exactly happens when you call ray.remote() on an object that is a torch cuda tensor or another machine. How does that cuda tensor get moved to your local GPU? I see a couple options:
NVLInk magic under the hood if applicable or direct gpu transfer via something like cudamemcpypeerasync and update only metadata in object store
Copy entire tensor to remote object store, fetch to local object store and then copy back onto gpu, all using cudamemcpy host to device or device to host.
Hi @marsupialtail , This is a very good question. I wonder also quite interesting to hear about your applications.
In my understanding, currently for the gpu transfer from one gpu to the other, it usually done manually. This also means there is lots of flexibility for the user to design their favorite gpu communication.
If you are looking at the distributed training, here are some examples:
@Jimmy’s right! Ray currently requires a copy to CPU memory and the object store before transferring tensors from a GPU. Of course, nothing in Ray prevents you from using a more efficient transport like NCCL or the CUDA runtime APIs, but it’s not yet built into Ray the same way CPU-CPU copies are. That said, if there’s a workload that would be enabled by this, we’re very interested in hearing about it.
@marsupialtail Do you have a workload that would be enabled by efficient GPU-GPU copies?