Tune & Pytorch Lightning: trials do not terminate, others do

Matt1 · December 3, 2022, 2:29pm

I’m using Centos 7, Pytorch Lightning and try to implement a hyperparameter tuning pipeline with Ray Tune, seems simple enough to follow the Guide. So my code looks like a adapted version of it.

My problem: only, seemingly random, trials each with full training and validation epochs terminate. I let 8 trials run, I set the hyperparams to the same value so every trial should do the same work, but only some terminate. The other trials run into infinity, it seems like no metric gets reported back on those trials. Again others do report the metric back to Tune, sometimes 4 out of 8, sometimes 3 out of 8.

Running the PL model manually works fine every time and I set the metric to a fixed value for control.

I am new to the community, please ask for additional information.

Yard1 · December 5, 2022, 4:30pm

Hey @Matt1, welcome to the community! Is it possible for you to share your code?

Topic		Replies	Views
Trials did not complete Exception Ray Tune	0	393	April 5, 2022
Trials did not complete on distributed tuning	0	298	April 18, 2023
Could not find best trial. Did you pass the correct `metric` parameter? Ray Tune	3	1443	December 17, 2021
Ray Train V2 with Ray Tune does not start another trial after a training run is TERMINATED Ray Train	3	21	April 17, 2025
Trouble with some results from Ray Tune	1	42	August 7, 2024

Tune & Pytorch Lightning: trials do not terminate, others do

Related topics