- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hey guys, correct me, If I am not right.
During distributed training, before optimization step, we aggregate all gradients from GPU workers into one head node’s GPU-worker(let’s call it main GPU) in order to average them and send back to every worker. Thus, “main GPU’s” memory consumption will be higher, than for the rest workers.
I have a next situation: If I train model on single instance (4xv100), than on each worker I can put training batch of size 2. But, If i train model on multiple instances via ray (4x4xv100), than I cannot use batch of size 2, because main GPU due to higher memory consumption gets OOM error, thus I should set batch size = 1 and get workers’
under-utilization (in terms of memory).
Question: Is there any workaround or best-practices for such situation?
P.S. This is an example of main GPU’s memory over-consumption and resulted overall GPUs under-utilization (in terms of memory).
Hey @Mykyta_Alekseiev, what Ray/training libraries are you using?
Hey @matthewdeng , thanks for your attention!
I am using RaySGDv1 with torch and nccl backend.
ray version: 1.12.1
I ran scripts from tutorial on my cluster with old and new version of RaySGD and got different GPU memory utilization patterns.
Getting Started with Distributed Machine Learning with PyTorch and Ray | by Michael Galarnyk | Distributed Computing with Ray | Medium (RaySDGv1)
train_fashion_mnist_example — Ray 1.12.1
So I have two questions:
- Am I right, that RaySGDv1 uses DataParallel? Because we can see that separate GPU on head node utilizes more memory, than other GPU-workers;
- Why, when using RaySGDv2, we don’t have uniform memory distribution between all worker-GPUs across cluster? Is it normal behaviour?
Both are using
DistributedDataParallel (RaySGDv1 is deprecated).
I would say this behavior is a bit unexpected - my guess is that PyTorch defaults to additional loading of data into GPU 0, but it’s not clear why.
Can you try printing
ray.train.torch.get_device() at the start of the training function? This will show the device ID that the dataloader and model are being moved to.
This prints me all possible device IDs:
(BaseWorkerMixin pid=27832) cuda:1
(BaseWorkerMixin pid=27831) cuda:0
(BaseWorkerMixin pid=23773, ip=10.140.0.81) cuda:0
(BaseWorkerMixin pid=23776, ip=10.140.0.81) cuda:1
Moreover, I ran new training with more workers and output nvidia-smi. We can some additional processes are taking memory from  GPU.
Full processes name:
Here is an additional output from dashboard:
Hey @Mykyta_Alekseiev, for the most recent screenshots that you posted, what code are you running? Is this still the train_fashion_mnist_example — Ray 1.12.1 with no modifications? What flags are you providing when you run the script?
Also, what pytorch version are you using?