Tuner cannot restore the checkpoints!

deepgravity · June 1, 2023, 9:45pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi all,

I am using Tuner and Tuner.fit() to train a PPO agent on my custom env.

I successfully train the agent for some time without any errors. Now I would like to restore the checkpoints for further training; however, it gives me the following error

*** RuntimeError: Could not find Tuner state in restore directory. Did you passthe correct path (including experiment directory?)

This is how my checkpoint directory looks like:

This is my code:

tuner = tune.Tuner(
            "PPO",
            run_config=run_config,
            param_space=param_space,
        )
   
chkpt_path = "/home/PPO/PPO_MasterEnv_214c3_00000_0_2023-06-01_15-13-42/checkpoint_001300/"

tuner.restore(chkpt_path)

results = tuner.fit()

What is more strange is that I can restore the checkpoint with the train and use the model for prediction, or further training like this:

algo =param_space.build()
chkpt_path = "/home/PPO/PPO_MasterEnv_214c3_00000_0_2023-06-01_15-13-42/checkpoint_001300/"
algo.restore(chkpt_path)
self.algo.train()

But I would like to use the Tuner, and this gives me that error!

Nor the Rllib documentation neither this example shows how to restore the checkpoints for Tuner.

Can anyone help me to fix this?

Thanks!

yunxuanx · June 8, 2023, 3:36am

Hi @deepgravity , to restore a tuner, you have to pass the experiment dir path (/home/PPO), instead of the checkpoint path.

deepgravity · June 8, 2023, 9:15am

Hi @yunxuanx , thank you for your reply. Yes, I had finally figured it out! It is really strange though. If I could remember correctly, in Rax 1.x, we passed the checkpoint dir path!

Shengchao_Y · November 9, 2023, 1:00pm

Hi @deepgravity , I restored the tuner with “/path_to/PPO” , but the tuner starts training from scratch rather than from the last checkpoint. How have you solved this?

justinvyu · November 9, 2023, 9:17pm

Hey @Shengchao_Y, how are you terminating the original run? Do you force kill the experiment?

Shengchao_Y · November 14, 2023, 11:56am

Hey @justinvyu , the original run was terminated due to an internal caused by a rarely happenning environment error. I would like to cuntinue the trial after fixing this error.

justinvyu · November 14, 2023, 5:50pm

Were any checkpoints saved before the run terminated? How long was it running for before the environment error?

Shengchao_Y · November 15, 2023, 8:59pm

Yes, I saved checkpoints every 20 iterations. It ran for about one and half days.

deepgravity · November 16, 2023, 10:24pm

I think you need to share some codes, otherwise, it would be hard to figure out a solution.

Windy · November 19, 2023, 3:50pm

oh man your code is so helpful for me, it took me 2 hours to fix it.Thanks

deepgravity · November 20, 2023, 8:13pm

Happy you found it helpful

Topic		Replies	Views
Using Tuner.restore in ray Checkpointing, Restoring	0	513	November 29, 2023
Unable to restore Ray Tune previous experiment checkpoint Ray Tune	8	1027	June 1, 2023
Restore agent and continue training with tune.run() RLlib	2	616	July 6, 2021
Correct way of using tuner.restore() Ray Tune	6	2312	November 16, 2022
Unable to restore fully trained checkpoint RLlib	19	3001	October 21, 2023

Tuner cannot restore the checkpoints!

Related topics