Unable to restore fully trained checkpoint

Xorgress_Grox · November 14, 2022, 8:00am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’ve finished training with a bunch of algorithms using the Tuner() API and air library and they all have their appropriate checkpoint folders and files. However I can’t seem to restore those checkpoints. I tried using Tuner.restore() and run(restore=), both didn’t work.
When using Tuner.restore() I got this error:

(ApexDQN pid=476180) 2022-11-14 14:39:07,333 INFO trainable.py:715 – Checkpoint path was not available, trying to recover from latest available checkpoint instead. Unavailable checkpoint path: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42\checkpoint_004000\checkpoint-4000

And for run(restore=) I got this error:

RuntimeError: Could not find Tuner state in restore directory. Did you passthe correct path (including experiment directory?) Got: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42

The training code:

github.com

TheGroxEmpire/ML_CIV6/blob/pettingzoo/train_apex-dqn.py

import pettingzoo_env

import os
import numpy as np
from ray import air, tune
from ray.tune import CLIReporter
from ray.tune.registry import register_env
from ray.rllib.algorithms.apex_dqn import ApexDQNConfig
from ray.rllib.env import PettingZooEnv
from ray.rllib.algorithms.callbacks import DefaultCallbacks


class MyCallbacks(DefaultCallbacks):

    def on_train_result(self, *, algorithm, result: dict, **kwargs):

        result["custom_metrics"]["policy_reward_mean"] = {

            "attacker": result["policy_reward_mean"].get("attacker", np.nan),

This file has been truncated. show original

I’ve also tried referring to the folders above the checkpoint file, it all resulted in the same error output.

Thank you in advance.

varunjammula · November 16, 2022, 10:21am

hi, i think you need to restore from: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42\ if you are using Tuner()

Xorgress_Grox · November 16, 2022, 3:38pm

I have tried that, and bunch of other folder path and none of them worked

varunjammula · November 20, 2022, 6:30pm

Can you upgrade to 2.1 and check?

Xorgress_Grox · November 22, 2022, 3:29am

I’ve tried 2.1, same issue persists.

Blubberblub · December 21, 2022, 7:55am

I found loading from checkpoint a little tricky too. This works for me:

algo_cls = get_algorithm_class("DQN")
algo = algo_cls(config=config)
checkpoint_path = "/path_to_folder/checkpoint_000225/rllib_checkpoint.json"
algo.restore(checkpoint_path)

arturn · December 21, 2022, 9:13am

Hi, please consider upgrading to 2.2 and use the Algorithm.from_checkpoint API.
Other than that: Can you post a reproduction script, please?

james116blue · December 29, 2022, 10:38am

Same error for me after successful training

    config = (
        PPOConfig()
        .rollouts(
            num_rollout_workers=4,
            # rollout_fragment_length=512
        )
        .training(
            train_batch_size=512,
            lr=2e-5,
            gamma=0.99,
            lambda_=0.9,
            use_gae=True,
            clip_param=0.4,
            grad_clip=None,
            entropy_coeff=0.1,
            vf_loss_coeff=0.25,
            sgd_minibatch_size=64,
            num_sgd_iter=10,
        )
        .environment(env=env_name, clip_actions=True)
        .debugging(log_level="ERROR")
        .framework(framework="torch")
        .resources(num_gpus=int(os.environ.get("RLLIB_NUM_GPUS", "0")))
    )

    tune.Tuner(
        "PPO",
        run_config=air.RunConfig(stop={"timesteps_total": 5000000}),
        param_space=config.to_dict(),
    ).fit()

Version 2.2
Also, there is a list of files in restore directory

error.txt  
events.out.tfevents.1671970856.pcname  
params.json  
params.pkl  
result.json

What is the correct way to load results of Tuner.fit() from directory in order to be analyzed?

arturn · December 29, 2022, 2:20pm

@james116blue ,

Please post a complete reproduction script!

Algorithm.load_checkpoint() is deprecated.
The “correct” way to load checkpoints is with the Algorithm.from_checkpoint() API.
You can find multiple examples in the examples folder and our documentation.

You can, for example retrieve the best checkpoint (and later load it) with:

best_checkpoint = results.get_best_result(
            metric="episode_reward_mean",
            mode="max"
        ).checkpoint

Cheers

james116blue · December 31, 2022, 1:47pm

Thank you for your detailed answer.
I have found solution thanks to your reply and updated documentation

AI360 · February 6, 2023, 2:37pm

Algorithm.from_checkpoint() can not load the checkpoints of Tuning in Ray 2.2. Could you please show how we can use it ? if it is possible in Colab.

arturn · February 6, 2023, 6:06pm

Hi @AI360 ,

This is an example from our docs: ray/saving_and_loading_algos_and_policies.py at master · ray-project/ray · GitHub

AI360 · February 6, 2023, 6:14pm

Yes, I checked it, but I got this error:

File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 278, in from_checkpoint
return Algorithm.from_state(state)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 306, in from_state
new_algo = algorithm_class(config=config)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 368, in init
config.validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py”, line 222, in validate
super().validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/pg/pg.py”, line 91, in validate
super().validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py”, line 556, in validate
self._resolve_tf_settings(_tf1, _tfv)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py”, line 2490, in _resolve_tf_settings
_tf1.enable_eager_execution()
File “/usr/local/lib/python3.10/site-packages/tensorflow/python/framework/ops.py”, line 6155, in enable_eager_execution
return enable_eager_execution_internal(
File “/usr/local/lib/python3.10/site-packages/tensorflow/python/framework/ops.py”, line 6223, in enable_eager_execution_internal
raise ValueError(
ValueError: tf.enable_eager_execution must be called at program startup.

Lars_Simon_Zehnder · February 19, 2023, 10:52pm

@arturn This error has to do with the _resolve_tf_settings() from the AlgorithmConfig. It checks for tf1 if eager_execution is enabled, if not it calls on tf1 ènable_eager_execution()` which creates this error.

Can we change this somehow?

arturn · March 8, 2023, 8:09pm

Are you getting this error on master? Because I can execute it without errors it seems.

Lars_Simon_Zehnder · March 8, 2023, 8:24pm

@arturn No, I got this error on Ray 2.2.0 as was also AI360 - this might have been fixed in the master and already in Ray 2.3.0

Lars_Simon_Zehnder · March 8, 2023, 8:28pm

@AI360 do you call from_checkpoint() in a Jupyter notebook? Also could you just try to put at the start of your code:

from ray.rllib.utils.framework import try_import_tf
tf1, tf, tfv = try_import_tf()

This at least should make your code work, I guess.

marc_hollyoak · June 5, 2023, 9:14am

Good morning. Just to confirm, I am still getting this error on 2.4.0.

Code (in .py):
checkpoint = algo.save()
restored_algo = Algorithm.from_checkpoint(checkpoint)

Error:
2023-06-05 09:46:28,769 WARNING checkpoints.py:109 – No rllib_checkpoint.json file found in checkpoint directory /home/marc/ray_results/DQN_GymEnvironment_2023-06-05_09-39-47ykvo2ont/checkpoint_000010! Trying to extract checkpoint info from other files found in that dir.

Just checking - is there fix due imminently please?
Many thanks.

hermmanhender · October 20, 2023, 3:53pm

Hi, I used the following and works, maybe it helps:

from ray.rllib.utils.framework import try_import_tf
tf1, tf, _ = try_import_tf()
tf1.enable_eager_execution()

Best regards.

fardinabbasi · October 21, 2023, 4:54am

Hi @arturn ,
I used Algorithm.from_checkpoint , but I still encountered the same issue. Would you please take a look at this post.

Topic		Replies	Views
Another tune after restoring a PPO algorithm Checkpointing, Restoring	2	308	December 15, 2023
Tuner cannot restore the checkpoints! Ray Tune	10	914	November 20, 2023
Policy rollout on Ray Tune 2.0 RLlib	4	319	December 15, 2022
Restoring RLlib Run Using Tuner.restore RLlib	5	640	February 17, 2024
Error when loading and restoring a trained algorithm from a checkpoint using a APPO Algorithm RLlib	1	352	February 14, 2023

Unable to restore fully trained checkpoint

Related topics