Can you copy data between two node's CPU and GPU

valiantljk · June 11, 2021, 10:43pm

Hi, I have a use case where one node in Ray cluster only has GPU resources left, no CPU available (occupied by other PODs outside Ray cluster),
the question is, in case of distributed RaySGD training, can Ray leverage this isolated GPU?

How is the data transfer between CPU and GPU handled by Ray? Can Ray copy the data from a remote CPU to this lonely GPU?

Dmitri · June 25, 2021, 6:27pm

@amogkam do you know about this?

amogkam · June 29, 2021, 9:03pm

Thanks the question @valiantljk, this is an interesting use case.

For the first question, if no CPUs are physically available then I don’t think you’d be able to execute any Python code. If it’s just that you don’t want Ray to reserve CPU for the training worker, you can achieve this by setting num_cpus_per_worker to 0 in your TorchTrainer.

For the second question, you could use Ray to copy the data from a remote CPU to a local CPU, and then use some other library like torch to copy it to GPU. But Ray itself does not use GPU memory for the object store.

valiantljk · June 29, 2021, 10:34pm

Thanks @amogkam
So, by default, how many resources does the torchtrainer reserve?

valiantljk · July 9, 2021, 6:08pm

@amogkam and is it possible to have 1 cpu to serve multiple GPU?

amogkam · July 12, 2021, 4:18pm

Hey @valiantjik- by default TorchTrainer uses 1 CPU per worker, and if you are using GPUs, then 1 GPU per worker. But you can modify the CPUs per worker (even to fractional values) by setting num_cpus_per_worker

Technically it is possible right now it to have 1 cpu with multiple GPUs if you pass in a fractional value to num_cpus_per_worker. But I haven’t tried this and it might not fully work. I completely agree though we want to add support for num_gpus_per_worker too.

Topic		Replies	Views
Can Ray Dataset facilitate training on heterogeneous clusters? Ray Data	6	1105	December 26, 2022
Ray.tune with pytorch: only uses 1 of 4 GPUs	1	314	May 15, 2023
Ray multiprocessing together with distributed learning Ray Train	1	559	March 2, 2022
Use either X CPU or Y GPU for a task	0	215	November 11, 2023
[Tune] [SGD] [RLlib] Distribute Training Across Nodes with Different GPUs Ray Tune	4	822	September 20, 2022

Can you copy data between two node's CPU and GPU

Related topics