Auto-finding batch size in Lightning not working with Tune

Doug · February 25, 2023, 1:30pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a small codebase that uses PyTorch Lightning. When not using Ray Tune, I can automatically find the best batch size by passing the auto_scale_batch_size argument to the Lightning Trainer and calling trainer.tune. However, when using Ray Tune with the same code, I always get a torch.cuda.OutOfMemoryError error in trainer.fit, after trainer.tune. If I remove the call to trainer.tune and select a fixed batch size, everything works.

This problem happens both when I use a single node (the head node) or a cluster with more computers.
I am not running distributed training, I train a single model in each GPU.
I am using ray.tune.integration.pytorch_lightning and not ray_lightning.

Does anyone know what causes this? I am posting here because it seems related to Ray and how it allocates resources.

This feature would be useful for me because I have a cluster with GPUs with different amounts of memory and I also try different network topologies that work best with different batch sizes.

justinvyu · February 28, 2023, 7:24pm

Hi @Doug,

A few questions:

Do you have a minimal reproduction script I could run?
What’s your hardware (what kind of GPU, how much GRAM)?
How large is your model?
What version of Ray are you on?

Doug · March 8, 2023, 5:39pm

I ended up refactoring my project and the issue disappeared, so I don’t believe it was a problem with Ray, but maybe my software design.

I’d consider this solved, if that’s OK with you @justinvyu.

Topic		Replies	Views
Out of Memory because of ray::ImplicitFunc.train Ray Tune	0	332	August 28, 2023
[Tune] Ray tune terminates after OSError: [Errno 28] No space left on device Ray Tune	2	2657	May 7, 2021
Ray using so much memory I cannot even start the tuning Ray Tune	5	2443	April 24, 2023
How to make all use of the GPU memory in Ray.tune	6	1350	December 6, 2022
Tuning fails with "The actor ImplicitFunc is too large" Ray Tune	2	1244	September 1, 2021

Auto-finding batch size in Lightning not working with Tune

Related topics