Auto-finding batch size in Lightning not working with Tune

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a small codebase that uses PyTorch Lightning. When not using Ray Tune, I can automatically find the best batch size by passing the auto_scale_batch_size argument to the Lightning Trainer and calling trainer.tune. However, when using Ray Tune with the same code, I always get a torch.cuda.OutOfMemoryError error in trainer.fit, after trainer.tune. If I remove the call to trainer.tune and select a fixed batch size, everything works.

  • This problem happens both when I use a single node (the head node) or a cluster with more computers.
  • I am not running distributed training, I train a single model in each GPU.
  • I am using ray.tune.integration.pytorch_lightning and not ray_lightning.

Does anyone know what causes this? I am posting here because it seems related to Ray and how it allocates resources.

This feature would be useful for me because I have a cluster with GPUs with different amounts of memory and I also try different network topologies that work best with different batch sizes.

Hi @Doug,

A few questions:

  • Do you have a minimal reproduction script I could run?
  • What’s your hardware (what kind of GPU, how much GRAM)?
  • How large is your model?
  • What version of Ray are you on?

I ended up refactoring my project and the issue disappeared, so I don’t believe it was a problem with Ray, but maybe my software design.

I’d consider this solved, if that’s OK with you @justinvyu.