Ray.get() on Torch CUDA tensors

marsupialtail · August 8, 2022, 4:35am

This might be a repeat question, if so please link.

I am curious what exactly happens when you call ray.remote() on an object that is a torch cuda tensor or another machine. How does that cuda tensor get moved to your local GPU? I see a couple options:

NVLInk magic under the hood if applicable or direct gpu transfer via something like cudamemcpypeerasync and update only metadata in object store
Copy entire tensor to remote object store, fetch to local object store and then copy back onto gpu, all using cudamemcpy host to device or device to host.

What exactly happens here?

rickyyx · August 9, 2022, 2:36am

Hey @marsupialtail , I am triaging the question. Will get back later. Please bear with me here.

Jimmy · August 9, 2022, 6:24am

Hi @marsupialtail , This is a very good question. I wonder also quite interesting to hear about your applications.

In my understanding, currently for the gpu transfer from one gpu to the other, it usually done manually. This also means there is lots of flexibility for the user to design their favorite gpu communication.

If you are looking at the distributed training, here are some examples:

torchtrainer uses the NCCL for the communication (https://github.com/ray-project/ray/blob/master/python/ray/train/torch/config.py#L82-L95)
alpatrainer (will be merged soon) uses XLA clients for the communication (https://github.com/alpa-projects/alpa/blob/main/alpa/device_mesh.py#L120-L122)

For the advanced usage, you might find this repo interesting (https://github.com/ray-project/prototype_gpu_buffer), where there is the collective behaviors for the ray remote actors.

marsupialtail · August 9, 2022, 7:49pm

What do you mean Alpatrainer will be merged soon? Will that be a part of Ray train or something?

I follow Alpa very closely

Jimmy · August 10, 2022, 3:22am

@marsupialtail i think so, the uni-test is now all passed.

Jimmy · August 11, 2022, 4:19pm

@cade would u like to add more context here? i really my answer to this question is not exact.

cade · August 11, 2022, 5:06pm

@Jimmy’s right! Ray currently requires a copy to CPU memory and the object store before transferring tensors from a GPU. Of course, nothing in Ray prevents you from using a more efficient transport like NCCL or the CUDA runtime APIs, but it’s not yet built into Ray the same way CPU-CPU copies are. That said, if there’s a workload that would be enabled by this, we’re very interested in hearing about it.

@marsupialtail Do you have a workload that would be enabled by efficient GPU-GPU copies?

marsupialtail · August 11, 2022, 5:26pm

Well I think that’s in line with my profiling results.

I was trying to build something like Alpa before Alpa happened. I think the way to go is to just use Alpa.

I suppose Alpa might’ve been easier to build if y’all supported GPU to GPU copies in the first place. I don’t really know though.

Topic		Replies	Views
Can't work with pytorch's gpu tensor	0	25	September 20, 2024
[Ray Core] RuntimeError: No CUDA GPUs are available Ray Core	5	4960	October 15, 2022
Does ray load the CUDA context multiple times? Ray Core	3	558	October 16, 2022
I need help! It took so long to execute a remote task in the Ray 1.13 when CUDA is involved Ray Core	12	323	August 12, 2022
Several questions about DL training (e.g. alexnet with pytorch) Ray Core	2	323	July 12, 2021

Ray.get() on Torch CUDA tensors

Related topics