I want to optimize hyper parameters for single-node distributed training using torch DDP. I start the training script with:
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)
where number of NUM_GPUS_YOU_HAVE is 8 (we use 8 gpus in training). Now, because I have 8 gpus on that training node I assume I can only run tune trials sequentially - which is ok. However, I am not sure how to set the parameters for DistributedTrainableCreator in order to replicate the type of training outlined above? More specifically what do I set the values for num_workers, num_gpus_per_worker, and num_workers_per_host? My best guess was:
-n 1 --num-gpus-per-worker 8 --workers-per-node 1
But then I got a warning from DDP in MNIST example
(pid=254593) /home/vblagoje/miniconda3/envs/transformers_tt/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:448: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. (pid=254593) "Single-Process Multi-GPU is not the recommended mode for "
Instead of launching one process per GPU as DDP single node multi-GPU training mandates this was not happening. If I tried using number_of_workers 8 then tune could not start as it requested 8 CPUs and the training machine has only 2. The regular training works because torch.distributed.launch assigns one process per GPU. How can I replicate single node multi-GPU training in tune?