- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hey guys, correct me, If I am not right.
During distributed training, before optimization step, we aggregate all gradients from GPU workers into one head node’s GPU-worker(let’s call it main GPU) in order to average them and send back to every worker. Thus, “main GPU’s” memory consumption will be higher, than for the rest workers.
I have a next situation: If I train model on single instance (4xv100), than on each worker I can put training batch of size 2. But, If i train model on multiple instances via ray (4x4xv100), than I cannot use batch of size 2, because main GPU due to higher memory consumption gets OOM error, thus I should set batch size = 1 and get workers’
under-utilization (in terms of memory).
Question: Is there any workaround or best-practices for such situation?
P.S. This is an example of main GPU’s memory over-consumption and resulted overall GPUs under-utilization (in terms of memory).
Hey @Mykyta_Alekseiev, what Ray/training libraries are you using?
Hey @matthewdeng , thanks for your attention!
I am using RaySGDv1 with torch and nccl backend.
UPD
ray version: 1.12.1
I ran scripts from tutorial on my cluster with old and new version of RaySGD and got different GPU memory utilization patterns.
-
Getting Started with Distributed Machine Learning with PyTorch and Ray | by Michael Galarnyk | Distributed Computing with Ray | Medium (RaySDGv1)
-
train_fashion_mnist_example — Ray 1.12.1
So I have two questions:
- Am I right, that RaySGDv1 uses DataParallel? Because we can see that separate GPU on head node utilizes more memory, than other GPU-workers;
- Why, when using RaySGDv2, we don’t have uniform memory distribution between all worker-GPUs across cluster? Is it normal behaviour?
Both are using DistributedDataParallel
(RaySGDv1 is deprecated).
I would say this behavior is a bit unexpected - my guess is that PyTorch defaults to additional loading of data into GPU 0, but it’s not clear why.
Can you try printing ray.train.torch.get_device()
at the start of the training function? This will show the device ID that the dataloader and model are being moved to.
This prints me all possible device IDs:
(BaseWorkerMixin pid=27832) cuda:1
(BaseWorkerMixin pid=27831) cuda:0
(BaseWorkerMixin pid=23773, ip=10.140.0.81) cuda:0
(BaseWorkerMixin pid=23776, ip=10.140.0.81) cuda:1
Moreover, I ran new training with more workers and output nvidia-smi. We can some additional processes are taking memory from [0] GPU.
Full processes name:
Here is an additional output from dashboard:
Hey @Mykyta_Alekseiev, for the most recent screenshots that you posted, what code are you running? Is this still the train_fashion_mnist_example — Ray 1.12.1 with no modifications? What flags are you providing when you run the script?
Also, what pytorch version are you using?