Possibility of extra GPU memory consumption with Ray Tune

amztc34283 · October 27, 2023, 7:53pm

I am tuning the hyper-parameters of the ResNet-50 model.

Previously, the maximum GPU consumption with batch_size=256 is around 8 GB running without Ray.

Now, I am running 8 trials on 8 GPUs (1 GPU each trial) but each trial now is consuming 14 GB with batch_size=128, I wonder if there is model-parallelism running under-the-hood causing this GPU consumption or something else.

justinvyu · November 2, 2023, 11:16pm

Hey @amztc34283,

There’s no model-parallelism happening by default in Ray Tune. Perhaps you can use a profiler like torch.profiler (torch.profiler — PyTorch 2.1 documentation) to diagnose the extra GPU memory consumption?

Topic		Replies	Views
How to make all use of the GPU memory in Ray.tune	6	1309	December 6, 2022
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	14	December 18, 2024
Ray using so much memory I cannot even start the tuning Ray Tune	5	2230	April 24, 2023
GPU memory not cleared after trial Ray Tune	3	1024	January 18, 2022
[Ray Train] [Ray Tune] [Ray Clusters] Handling different GPUs (with different GPU memory sizes) in a Ray cluster)	0	452	February 2, 2023

Possibility of extra GPU memory consumption with Ray Tune

Related topics