How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have a small codebase that uses PyTorch Lightning. When not using Ray Tune, I can automatically find the best batch size by passing the auto_scale_batch_size argument to the Lightning Trainer and calling trainer.tune. However, when using Ray Tune with the same code, I always get a torch.cuda.OutOfMemoryError error in trainer.fit, after trainer.tune. If I remove the call to trainer.tune and select a fixed batch size, everything works.
- This problem happens both when I use a single node (the head node) or a cluster with more computers.
- I am not running distributed training, I train a single model in each GPU.
- I am using
ray.tune.integration.pytorch_lightningand notray_lightning.
Does anyone know what causes this? I am posting here because it seems related to Ray and how it allocates resources.
This feature would be useful for me because I have a cluster with GPUs with different amounts of memory and I also try different network topologies that work best with different batch sizes.