Ray Tune for single-node distributed training in PyTorch

Hey guys,

I want to optimize hyper parameters for single-node distributed training using torch DDP. I start the training script with:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
           arguments of your training script)

where number of NUM_GPUS_YOU_HAVE is 8 (we use 8 gpus in training). Now, because I have 8 gpus on that training node I assume I can only run tune trials sequentially - which is ok. However, I am not sure how to set the parameters for DistributedTrainableCreator in order to replicate the type of training outlined above? More specifically what do I set the values for num_workers, num_gpus_per_worker, and num_workers_per_host? My best guess was:

-n 1 --num-gpus-per-worker 8 --workers-per-node 1

But then I got a warning from DDP in MNIST example

(pid=254593) /home/vblagoje/miniconda3/envs/transformers_tt/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:448: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. 
(pid=254593)   "Single-Process Multi-GPU is not the recommended mode for "

Instead of launching one process per GPU as DDP single node multi-GPU training mandates this was not happening. If I tried using number_of_workers 8 then tune could not start as it requested 8 CPUs and the training machine has only 2. The regular training works because torch.distributed.launch assigns one process per GPU. How can I replicate single node multi-GPU training in tune?


Hey @vblagoje, yeah setting num_workers to 8 and num_gpus_per_worker to 1 is right. That’s equivalent to 8 processes. Since you only have 2 CPUs, you can set num_cpus_per_worker to 0.25, and this will allow you to have 8 workers.

Hey @amogkam still a bit confused. If I run MNIST ddp example with:

python ddp_mnist_torch.py --num-workers 8 --num-gpus-per-worker 8 

I get the exception that I don’t have 64 GPUs. However:

python ddp_mnist_torch.py --num-workers 8 --num-gpus-per-worker 1 

now works. I realized that the limitation of 2 CPUs comes from ddp_mnist_torch.py example. I removed that limitation.

Is this last setting perhaps equivalent to:

python -m torch.distributed.launch --nproc_per_node=8 ddp_mnist_torch.py

is my question?


1 Like

Ah sorry I made a typo in my response. I meant to say num_gpus_per_worker should be 1.

Yes, what you have now is equivalent! num_workers is the same as number of processes.

1 Like