Can you copy data between two node's CPU and GPU

Hi, I have a use case where one node in Ray cluster only has GPU resources left, no CPU available (occupied by other PODs outside Ray cluster),
the question is, in case of distributed RaySGD training, can Ray leverage this isolated GPU?

How is the data transfer between CPU and GPU handled by Ray? Can Ray copy the data from a remote CPU to this lonely GPU?

@amogkam do you know about this?

Thanks the question @valiantljk, this is an interesting use case.

For the first question, if no CPUs are physically available then I don’t think you’d be able to execute any Python code. If it’s just that you don’t want Ray to reserve CPU for the training worker, you can achieve this by setting num_cpus_per_worker to 0 in your TorchTrainer.

For the second question, you could use Ray to copy the data from a remote CPU to a local CPU, and then use some other library like torch to copy it to GPU. But Ray itself does not use GPU memory for the object store.

1 Like

Thanks @amogkam
So, by default, how many resources does the torchtrainer reserve?

@amogkam and is it possible to have 1 cpu to serve multiple GPU?

Hey @valiantjik- by default TorchTrainer uses 1 CPU per worker, and if you are using GPUs, then 1 GPU per worker. But you can modify the CPUs per worker (even to fractional values) by setting num_cpus_per_worker

Technically it is possible right now it to have 1 cpu with multiple GPUs if you pass in a fractional value to num_cpus_per_worker. But I haven’t tried this and it might not fully work. I completely agree though we want to add support for num_gpus_per_worker too.

1 Like