Tuner cannot restore the checkpoints!

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi all,

I am using Tuner and Tuner.fit() to train a PPO agent on my custom env.

I successfully train the agent for some time without any errors. Now I would like to restore the checkpoints for further training; however, it gives me the following error

*** RuntimeError: Could not find Tuner state in restore directory. Did you passthe correct path (including experiment directory?) 

This is how my checkpoint directory looks like:

This is my code:

tuner = tune.Tuner(
            "PPO",
            run_config=run_config,
            param_space=param_space,
        )
   
chkpt_path = "/home/PPO/PPO_MasterEnv_214c3_00000_0_2023-06-01_15-13-42/checkpoint_001300/"

tuner.restore(chkpt_path)

results = tuner.fit()

What is more strange is that I can restore the checkpoint with the train and use the model for prediction, or further training like this:

algo =param_space.build()
chkpt_path = "/home/PPO/PPO_MasterEnv_214c3_00000_0_2023-06-01_15-13-42/checkpoint_001300/"
algo.restore(chkpt_path)
self.algo.train()

But I would like to use the Tuner, and this gives me that error!

Nor the Rllib documentation neither this example shows how to restore the checkpoints for Tuner.

Can anyone help me to fix this?

Thanks!

Hi @deepgravity , to restore a tuner, you have to pass the experiment dir path (/home/PPO), instead of the checkpoint path.

1 Like

Hi @yunxuanx , thank you for your reply. Yes, I had finally figured it out! It is really strange though. If I could remember correctly, in Rax 1.x, we passed the checkpoint dir path!

Hi @deepgravity , I restored the tuner with “/path_to/PPO” , but the tuner starts training from scratch rather than from the last checkpoint. How have you solved this?

Hey @Shengchao_Y, how are you terminating the original run? Do you force kill the experiment?

Hey @justinvyu , the original run was terminated due to an internal caused by a rarely happenning environment error. I would like to cuntinue the trial after fixing this error.

Were any checkpoints saved before the run terminated? How long was it running for before the environment error?

Yes, I saved checkpoints every 20 iterations. It ran for about one and half days.

I think you need to share some codes, otherwise, it would be hard to figure out a solution.

oh man your code is so helpful for me, it took me 2 hours to fix it.Thanks

1 Like

Happy you found it helpful