torch.nn.DataParallel with tune.run()

flpgrz · June 27, 2022, 6:13pm

Hello,

on a server I have 5 GPUs.
Imagine a single torch model does not fit on a single GPU, hence I have to use torch.nn.DataParallel. This splits the batches across all available GPUs.

Now, I want to do hyperparameter optimization using Ray (tune.run()). Even though I wrap my torch model in torch.nn.DataParallel, it seems that Ray ignores that, and still tries to use 1 GPU per model. This throws a memory error.

How can I distribute one single trial on multiple GPUs?
I tried resources_per_trial={"gpu": 5} , but no success.

Thanks.

matthewdeng · June 28, 2022, 5:12am

Imagine a single torch model does not fit on a single GPU

By this do you mean the model does not fit on the GPU or a single batch does not fit on the GPU?

If you are indeed looking for data parallelism, I’d recommend checking out Ray Train! Also, in general it is recommended to use DistributedDataParallel instead of DataParallel.

If your model does not fit on the GPU, you may need to use Pipeline Parallelism.

Topic		Replies	Views
Ray.tune with pytorch: only uses 1 of 4 GPUs	1	324	May 15, 2023
Data Parallelism with Ray Tune	4	360	November 2, 2023
What is the right way of using Ray tune with Pytorch DDP Ray Tune	1	1066	February 23, 2024
Multi-gpu ray tune for hparams not parallelizing and only using first gpu	0	95	July 10, 2024
Hyperparameter optimization on Slurm using DistributedDataParallel and mpi4py Ray Tune	3	123	December 11, 2024

torch.nn.DataParallel with tune.run()

Related topics