Tune & Pytorch Lightning: trials do not terminate, others do

I’m using Centos 7, Pytorch Lightning and try to implement a hyperparameter tuning pipeline with Ray Tune, seems simple enough to follow the Guide. So my code looks like a adapted version of it.

My problem: only, seemingly random, trials each with full training and validation epochs terminate. I let 8 trials run, I set the hyperparams to the same value so every trial should do the same work, but only some terminate. The other trials run into infinity, it seems like no metric gets reported back on those trials. Again others do report the metric back to Tune, sometimes 4 out of 8, sometimes 3 out of 8.

Running the PL model manually works fine every time and I set the metric to a fixed value for control.

I am new to the community, please ask for additional information.

Hey @Matt1, welcome to the community! Is it possible for you to share your code?