Unable to restore fully trained checkpoint

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’ve finished training with a bunch of algorithms using the Tuner() API and air library and they all have their appropriate checkpoint folders and files. However I can’t seem to restore those checkpoints. I tried using Tuner.restore() and run(restore=), both didn’t work.
When using Tuner.restore() I got this error:

(ApexDQN pid=476180) 2022-11-14 14:39:07,333 INFO trainable.py:715 – Checkpoint path was not available, trying to recover from latest available checkpoint instead. Unavailable checkpoint path: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42\checkpoint_004000\checkpoint-4000

And for run(restore=) I got this error:

RuntimeError: Could not find Tuner state in restore directory. Did you passthe correct path (including experiment directory?) Got: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42

The training code:

I’ve also tried referring to the folders above the checkpoint file, it all resulted in the same error output.

Thank you in advance.

hi, i think you need to restore from: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42\ if you are using Tuner()

I have tried that, and bunch of other folder path and none of them worked

Can you upgrade to 2.1 and check?

I’ve tried 2.1, same issue persists.

I found loading from checkpoint a little tricky too. This works for me:

algo_cls = get_algorithm_class("DQN")
algo = algo_cls(config=config)
checkpoint_path = "/path_to_folder/checkpoint_000225/rllib_checkpoint.json"
algo.restore(checkpoint_path)
1 Like

Hi, please consider upgrading to 2.2 and use the Algorithm.from_checkpoint API.
Other than that: Can you post a reproduction script, please?

Same error for me after successful training

    config = (
        PPOConfig()
        .rollouts(
            num_rollout_workers=4,
            # rollout_fragment_length=512
        )
        .training(
            train_batch_size=512,
            lr=2e-5,
            gamma=0.99,
            lambda_=0.9,
            use_gae=True,
            clip_param=0.4,
            grad_clip=None,
            entropy_coeff=0.1,
            vf_loss_coeff=0.25,
            sgd_minibatch_size=64,
            num_sgd_iter=10,
        )
        .environment(env=env_name, clip_actions=True)
        .debugging(log_level="ERROR")
        .framework(framework="torch")
        .resources(num_gpus=int(os.environ.get("RLLIB_NUM_GPUS", "0")))
    )

    tune.Tuner(
        "PPO",
        run_config=air.RunConfig(stop={"timesteps_total": 5000000}),
        param_space=config.to_dict(),
    ).fit()

Version 2.2
Also, there is a list of files in restore directory

error.txt  
events.out.tfevents.1671970856.pcname  
params.json  
params.pkl  
result.json

What is the correct way to load results of Tuner.fit() from directory in order to be analyzed?

@james116blue ,

Please post a complete reproduction script!

Algorithm.load_checkpoint() is deprecated.
The “correct” way to load checkpoints is with the Algorithm.from_checkpoint() API.
You can find multiple examples in the examples folder and our documentation.

You can, for example retrieve the best checkpoint (and later load it) with:

best_checkpoint = results.get_best_result(
            metric="episode_reward_mean",
            mode="max"
        ).checkpoint

Cheers

2 Likes

Thank you for your detailed answer.
I have found solution thanks to your reply and updated documentation

1 Like

Algorithm.from_checkpoint() can not load the checkpoints of Tuning in Ray 2.2. Could you please show how we can use it ? if it is possible in Colab.

Hi @AI360 ,

This is an example from our docs: ray/saving_and_loading_algos_and_policies.py at master · ray-project/ray · GitHub

Yes, I checked it, but I got this error:

File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 278, in from_checkpoint
return Algorithm.from_state(state)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 306, in from_state
new_algo = algorithm_class(config=config)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 368, in init
config.validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py”, line 222, in validate
super().validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/pg/pg.py”, line 91, in validate
super().validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py”, line 556, in validate
self._resolve_tf_settings(_tf1, _tfv)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py”, line 2490, in _resolve_tf_settings
_tf1.enable_eager_execution()
File “/usr/local/lib/python3.10/site-packages/tensorflow/python/framework/ops.py”, line 6155, in enable_eager_execution
return enable_eager_execution_internal(
File “/usr/local/lib/python3.10/site-packages/tensorflow/python/framework/ops.py”, line 6223, in enable_eager_execution_internal
raise ValueError(
ValueError: tf.enable_eager_execution must be called at program startup.

@arturn This error has to do with the _resolve_tf_settings() from the AlgorithmConfig. It checks for tf1 if eager_execution is enabled, if not it calls on tf1 ènable_eager_execution()` which creates this error.

Can we change this somehow?

Are you getting this error on master? Because I can execute it without errors it seems.

@arturn No, I got this error on Ray 2.2.0 as was also AI360 - this might have been fixed in the master and already in Ray 2.3.0

1 Like

@AI360 do you call from_checkpoint() in a Jupyter notebook? Also could you just try to put at the start of your code:

from ray.rllib.utils.framework import try_import_tf
tf1, tf, tfv = try_import_tf()

This at least should make your code work, I guess.

Good morning. Just to confirm, I am still getting this error on 2.4.0.

Code (in .py):
checkpoint = algo.save()
restored_algo = Algorithm.from_checkpoint(checkpoint)

Error:
2023-06-05 09:46:28,769 WARNING checkpoints.py:109 – No rllib_checkpoint.json file found in checkpoint directory /home/marc/ray_results/DQN_GymEnvironment_2023-06-05_09-39-47ykvo2ont/checkpoint_000010! Trying to extract checkpoint info from other files found in that dir.

Just checking - is there fix due imminently please?
Many thanks.

1 Like

Hi, I used the following and works, maybe it helps:

from ray.rllib.utils.framework import try_import_tf
tf1, tf, _ = try_import_tf()
tf1.enable_eager_execution()

Best regards.

Hi @arturn ,
I used Algorithm.from_checkpoint , but I still encountered the same issue. Would you please take a look at this post.