Ray.get() on Torch CUDA tensors

This might be a repeat question, if so please link.

I am curious what exactly happens when you call ray.remote() on an object that is a torch cuda tensor or another machine. How does that cuda tensor get moved to your local GPU? I see a couple options:

  • NVLInk magic under the hood if applicable or direct gpu transfer via something like cudamemcpypeerasync and update only metadata in object store
  • Copy entire tensor to remote object store, fetch to local object store and then copy back onto gpu, all using cudamemcpy host to device or device to host.

What exactly happens here?

Hey @marsupialtail , I am triaging the question. Will get back later. Please bear with me here.

Hi @marsupialtail , This is a very good question. I wonder also quite interesting to hear about your applications.

In my understanding, currently for the gpu transfer from one gpu to the other, it usually done manually. This also means there is lots of flexibility for the user to design their favorite gpu communication.

If you are looking at the distributed training, here are some examples:

For the advanced usage, you might find this repo interesting (https://github.com/ray-project/prototype_gpu_buffer), where there is the collective behaviors for the ray remote actors.

1 Like

What do you mean Alpatrainer will be merged soon? Will that be a part of Ray train or something?

I follow Alpa very closely :smile:

@marsupialtail i think so, the uni-test is now all passed.

@cade would u like to add more context here? i really my answer to this question is not exact.

@Jimmy’s right! Ray currently requires a copy to CPU memory and the object store before transferring tensors from a GPU. Of course, nothing in Ray prevents you from using a more efficient transport like NCCL or the CUDA runtime APIs, but it’s not yet built into Ray the same way CPU-CPU copies are. That said, if there’s a workload that would be enabled by this, we’re very interested in hearing about it.

@marsupialtail Do you have a workload that would be enabled by efficient GPU-GPU copies?

Well I think that’s in line with my profiling results.

I was trying to build something like Alpa before Alpa happened. I think the way to go is to just use Alpa.

I suppose Alpa might’ve been easier to build if y’all supported GPU to GPU copies in the first place. I don’t really know though.

1 Like