- High: It blocks me to complete my task.
I have a weird bug with my RTX3090 - it throws CUDA OOM error after about 3 hours of training. Only way so far to fix it, is to stop tune training manually and then restart with resume=True. If i set up max_failures=0 (to get it to stop automatically) then resume will not do anything as all the experiments are in trial.ERROR state.
Basically what i need to do is to reset trial state to PAUSED and then start (resume) run from last checkpoints.
How can I change all the trials to different state before calling tune.run?
I can do it manually in experiment directory in experiment_state-.json - but is there some other way, via code?