Can't properly restore result trained with RLlib using Ray.train.Result

Morphlng · May 28, 2024, 9:02am

Hi! I want to check the corresponding metric for each of my checkpoint saved when training with RLlib and Ray Tune. I’ve refered to the Ray.train.Result api and tried to restore from path. However, there is no “checkpoint_dir_name” in the metric_df restored, and thus the result loading failed.

This can be easily reproduced by following code:

from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune, train
from ray.train import RunConfig, CheckpointConfig

tuner = tune.Tuner(
  "PPO",
  param_space=PPOConfig().environment("CartPole-v1").to_dict(),
  run_config=RunConfig(
    checkpoint_config=CheckpointConfig(checkpoint_at_end=True, checkpoint_frequency=1),
    stop={"training_iteration": 3}
  ）
)

best_result = tuner.fit().get_best_result()

# Loading
from ray.train import Result

restored_result = Result.from_path(best_result.path)

The script above will raise KeyError, saying there isn’t “checkpoint_dir_name” in the metrics_df

Morphlng · May 29, 2024, 1:48am

It seems like the problem is RLlib does not autofill the metric “checkpoint_dir_name” that both train/tune required. By adding a custom callback, this can be dealt with:

class CheckpointCallback(DefaultCallbacks):
    def on_train_result(self, *, algorithm: Algorithm, result: dict, **kwargs) -> None:
        if algorithm._storage:
            result["checkpoint_dir_name"] = algorithm._storage.checkpoint_dir_name

config = PPOConfig().callbacks(CheckpointCallback)

But even with this callback, I still can’t get the ending checkpoint metric.

Topic		Replies	Views
Ray restore checkpoint in rllib RLlib	6	1636	August 11, 2021
Unable to restore fully trained checkpoint RLlib	19	2863	October 21, 2023
[Rllib] how to restore trainer from different checkpoint files when training on server and local RLlib	1	286	February 3, 2023
Empty checkpoint files with Tune.run RLlib	1	380	March 30, 2022
Tuner cannot restore the checkpoints! Ray Tune	10	878	November 20, 2023

Can't properly restore result trained with RLlib using Ray.train.Result

Related topics