How to continue errored out tune.run

  • High: It blocks me to complete my task.

I have a weird bug with my RTX3090 - it throws CUDA OOM error after about 3 hours of training. Only way so far to fix it, is to stop tune training manually and then restart with resume=True. If i set up max_failures=0 (to get it to stop automatically) then resume will not do anything as all the experiments are in trial.ERROR state.

Basically what i need to do is to reset trial state to PAUSED and then start (resume) run from last checkpoints.

How can I change all the trials to different state before calling tune.run?

I can do it manually in experiment directory in experiment_state-.json - but is there some other way, via code?

Hi @Andres_Kaver, there is currently no great way to continue training after it gracefully finished, but we’re looking into this! For the original problem, can you share a bit of your code, e.g. the configuration and the call to tune.run()? Do multiple trials access the same GPU or is it just one trial that increases GPU memory utilization over time? If so that sounds like a memory leak, but if it’s more than one trial it could also be a bookkeeping issue on the Ray side.