Ray Tune for single-node distributed training in PyTorch

vblagoje · August 24, 2021, 8:55am

Hey guys,

I want to optimize hyper parameters for single-node distributed training using torch DDP. I start the training script with:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
           arguments of your training script)

where number of NUM_GPUS_YOU_HAVE is 8 (we use 8 gpus in training). Now, because I have 8 gpus on that training node I assume I can only run tune trials sequentially - which is ok. However, I am not sure how to set the parameters for DistributedTrainableCreator in order to replicate the type of training outlined above? More specifically what do I set the values for num_workers, num_gpus_per_worker, and num_workers_per_host? My best guess was:

-n 1 --num-gpus-per-worker 8 --workers-per-node 1

But then I got a warning from DDP in MNIST example

(pid=254593) /home/vblagoje/miniconda3/envs/transformers_tt/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:448: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. 
(pid=254593)   "Single-Process Multi-GPU is not the recommended mode for "

Instead of launching one process per GPU as DDP single node multi-GPU training mandates this was not happening. If I tried using number_of_workers 8 then tune could not start as it requested 8 CPUs and the training machine has only 2. The regular training works because torch.distributed.launch assigns one process per GPU. How can I replicate single node multi-GPU training in tune?

Thanks,
Vladimir

amogkam · August 24, 2021, 4:14pm

Hey @vblagoje, yeah setting num_workers to 8 and num_gpus_per_worker to 1 is right. That’s equivalent to 8 processes. Since you only have 2 CPUs, you can set num_cpus_per_worker to 0.25, and this will allow you to have 8 workers.

vblagoje · August 24, 2021, 4:49pm

Hey @amogkam still a bit confused. If I run MNIST ddp example with:

python ddp_mnist_torch.py --num-workers 8 --num-gpus-per-worker 8

I get the exception that I don’t have 64 GPUs. However:

python ddp_mnist_torch.py --num-workers 8 --num-gpus-per-worker 1

now works. I realized that the limitation of 2 CPUs comes from ddp_mnist_torch.py example. I removed that limitation.

Is this last setting perhaps equivalent to:

python -m torch.distributed.launch --nproc_per_node=8 ddp_mnist_torch.py

is my question?

Thanks,
Vladimir

amogkam · August 24, 2021, 5:01pm

Ah sorry I made a typo in my response. I meant to say num_gpus_per_worker should be 1.

Yes, what you have now is equivalent! num_workers is the same as number of processes.

Topic		Replies	Views
Ray Tune does not work properly with DDP PyTorch Lightning Ray Tune	8	1673	March 17, 2022
Ray.tune with pytorch: only uses 1 of 4 GPUs	1	317	May 15, 2023
Getting DistributedTrainableCreator to train with all GPUs Ray Tune	1	732	May 12, 2022
What is the right way of using Ray tune with Pytorch DDP Ray Tune	1	1029	February 23, 2024
torch.nn.DataParallel with tune.run() Ray Tune	1	772	June 28, 2022

Ray Tune for single-node distributed training in PyTorch

Related topics