Unable to restore fully trained checkpoint

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’ve finished training with a bunch of algorithms using the Tuner() API and air library and they all have their appropriate checkpoint folders and files. However I can’t seem to restore those checkpoints. I tried using Tuner.restore() and run(restore=), both didn’t work.
When using Tuner.restore() I got this error:

(ApexDQN pid=476180) 2022-11-14 14:39:07,333 INFO trainable.py:715 – Checkpoint path was not available, trying to recover from latest available checkpoint instead. Unavailable checkpoint path: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42\checkpoint_004000\checkpoint-4000

And for run(restore=) I got this error:

RuntimeError: Could not find Tuner state in restore directory. Did you passthe correct path (including experiment directory?) Got: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42

The training code:

I’ve also tried referring to the folders above the checkpoint file, it all resulted in the same error output.

Thank you in advance.

hi, i think you need to restore from: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42\ if you are using Tuner()

I have tried that, and bunch of other folder path and none of them worked

Can you upgrade to 2.1 and check?

I’ve tried 2.1, same issue persists.