Possibility of extra GPU memory consumption with Ray Tune

I am tuning the hyper-parameters of the ResNet-50 model.

Previously, the maximum GPU consumption with batch_size=256 is around 8 GB running without Ray.

Now, I am running 8 trials on 8 GPUs (1 GPU each trial) but each trial now is consuming 14 GB with batch_size=128, I wonder if there is model-parallelism running under-the-hood causing this GPU consumption or something else.

Hey @amztc34283,

There’s no model-parallelism happening by default in Ray Tune. Perhaps you can use a profiler like torch.profiler (torch.profiler — PyTorch 2.1 documentation) to diagnose the extra GPU memory consumption?

1 Like