Workaround for GPU-workers non-equal memory consumption

Mykyta_Alekseiev · May 23, 2022, 1:48pm

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hey guys, correct me, If I am not right.
During distributed training, before optimization step, we aggregate all gradients from GPU workers into one head node’s GPU-worker(let’s call it main GPU) in order to average them and send back to every worker. Thus, “main GPU’s” memory consumption will be higher, than for the rest workers.
I have a next situation: If I train model on single instance (4xv100), than on each worker I can put training batch of size 2. But, If i train model on multiple instances via ray (4x4xv100), than I cannot use batch of size 2, because main GPU due to higher memory consumption gets OOM error, thus I should set batch size = 1 and get workers’
under-utilization (in terms of memory).
Question: Is there any workaround or best-practices for such situation?
P.S. This is an example of main GPU’s memory over-consumption and resulted overall GPUs under-utilization (in terms of memory).

matthewdeng · May 23, 2022, 11:39pm

Hey @Mykyta_Alekseiev, what Ray/training libraries are you using?

Mykyta_Alekseiev · May 24, 2022, 6:10am

Hey @matthewdeng , thanks for your attention!
I am using RaySGDv1 with torch and nccl backend.

Mykyta_Alekseiev · May 25, 2022, 11:14am

UPD
ray version: 1.12.1

I ran scripts from tutorial on my cluster with old and new version of RaySGD and got different GPU memory utilization patterns.

Getting Started with Distributed Machine Learning with PyTorch and Ray | by Michael Galarnyk | Distributed Computing with Ray | Medium (RaySDGv1)

image1545×383 106 KB
train_fashion_mnist_example — Ray 1.12.1

image1536×331 76.8 KB

So I have two questions:

Am I right, that RaySGDv1 uses DataParallel? Because we can see that separate GPU on head node utilizes more memory, than other GPU-workers;
Why, when using RaySGDv2, we don’t have uniform memory distribution between all worker-GPUs across cluster? Is it normal behaviour?

matthewdeng · May 27, 2022, 6:18pm

Both are using DistributedDataParallel (RaySGDv1 is deprecated).

I would say this behavior is a bit unexpected - my guess is that PyTorch defaults to additional loading of data into GPU 0, but it’s not clear why.

Can you try printing ray.train.torch.get_device() at the start of the training function? This will show the device ID that the dataloader and model are being moved to.

Mykyta_Alekseiev · June 1, 2022, 5:10pm

This prints me all possible device IDs:
(BaseWorkerMixin pid=27832) cuda:1
(BaseWorkerMixin pid=27831) cuda:0
(BaseWorkerMixin pid=23773, ip=10.140.0.81) cuda:0
(BaseWorkerMixin pid=23776, ip=10.140.0.81) cuda:1

Mykyta_Alekseiev · June 1, 2022, 6:19pm

Moreover, I ran new training with more workers and output nvidia-smi. We can some additional processes are taking memory from [0] GPU.

Full processes name:

Here is an additional output from dashboard:

amogkam · June 1, 2022, 10:26pm

Hey @Mykyta_Alekseiev, for the most recent screenshots that you posted, what code are you running? Is this still the train_fashion_mnist_example — Ray 1.12.1 with no modifications? What flags are you providing when you run the script?

Also, what pytorch version are you using?

Topic		Replies	Views
How to reduce GPU memory consumption overhead of actor workers Ray Core	2	382	May 27, 2021
Ray multiprocessing together with distributed learning Ray Train	1	556	March 2, 2022
When to use multi gpus per worker for a training job	1	213	September 15, 2024
Running the ray training example got error Configure Algorithm, Training, Evaluation, Scaling	1	414	November 2, 2023
[RAY SGD] Train pytorch model on machine with 2 GPUs Ray Tune	2	431	February 19, 2021

Workaround for GPU-workers non-equal memory consumption

Related topics