How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have a small codebase that uses PyTorch Lightning. When not using Ray Tune, I can automatically find the best batch size by passing the auto_scale_batch_size
argument to the Lightning Trainer and calling trainer.tune
. However, when using Ray Tune with the same code, I always get a torch.cuda.OutOfMemoryError
error in trainer.fit
, after trainer.tune
. If I remove the call to trainer.tune
and select a fixed batch size, everything works.
- This problem happens both when I use a single node (the head node) or a cluster with more computers.
- I am not running distributed training, I train a single model in each GPU.
- I am using
ray.tune.integration.pytorch_lightning
and notray_lightning
.
Does anyone know what causes this? I am posting here because it seems related to Ray and how it allocates resources.
This feature would be useful for me because I have a cluster with GPUs with different amounts of memory and I also try different network topologies that work best with different batch sizes.