How to continue errored out tune.run

Andres_Kaver · March 29, 2022, 9:36pm

High: It blocks me to complete my task.

I have a weird bug with my RTX3090 - it throws CUDA OOM error after about 3 hours of training. Only way so far to fix it, is to stop tune training manually and then restart with resume=True. If i set up max_failures=0 (to get it to stop automatically) then resume will not do anything as all the experiments are in trial.ERROR state.

Basically what i need to do is to reset trial state to PAUSED and then start (resume) run from last checkpoints.

How can I change all the trials to different state before calling tune.run?

I can do it manually in experiment directory in experiment_state-.json - but is there some other way, via code?

kai · April 4, 2022, 6:13pm

Hi @Andres_Kaver, there is currently no great way to continue training after it gracefully finished, but we’re looking into this! For the original problem, can you share a bit of your code, e.g. the configuration and the call to tune.run()? Do multiple trials access the same GPU or is it just one trial that increases GPU memory utilization over time? If so that sounds like a memory leak, but if it’s more than one trial it could also be a bookkeeping issue on the Ray side.

Topic		Replies	Views
Continue training of finished trials (Tune, RLLIB, PPO) Ray Tune	3	38	May 31, 2025
[Ray Tune] Blocking for next trial Ray Tune	3	472	June 8, 2022
Issue while resuming "ERRORED" trials Ray Tune	2	407	September 21, 2021
Continue training after finishing first run RLlib	3	395	June 14, 2021
Ray Tune Trials Failing to Resume After Saving and Restoring on Google Colab: AttributeError 'Checkpoint' Object Has No Attribute 'to_dict' Ray Tune	0	12	September 7, 2024

How to continue errored out tune.run

Related topics