How to save the best checkpoint of the training using RLLIB

carlorop · March 23, 2022, 10:09am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am training a model in which the convergence is very unstable. I would like to know how to save the best checkpoint of the training using RLLIB. I am running the training using:

experiment_params = {
            "training": {
                "env": "wf",
                "run": args.algorithm,
                "stop": {"training_iteration": 2000
                         #"timesteps_total": 8000000,
                        
                        },
                "local_dir": "/opt/ml/output/intermediate",
                "checkpoint_at_end": True,
                "checkpoint_freq": 10,
                "config": {
                    "num_workers": int(os.cpu_count())-1,                
                    "lr": 0.0001,
                    "num_gpus":  num_gpus,
                    "gamma": float(args.gamma),
                    "seed":args.seed,
                },

            }
        }
    ray.tune.run_experiments(copy.deepcopy(experiment_params))

kai · March 23, 2022, 10:46am

Hi @carlorop, with that configuration you should already see checkpoints being saved in a subdirectory of /opt/ml/output/intermediate. So they are already saved.

You can get the best checkpoint using e.g.

analysis = ray.tune.ExperimentAnalysis("/opt/ml/output/intermediate/training")
analysis.get_best_checkpoint(analysis.trials[0], "episode_reward_mean", "max")

(note the directory name might be slightly different)

See also Analysis (tune.analysis) — Ray 1.11.0

Topic		Replies	Views
Ray restore checkpoint in rllib RLlib	6	1639	August 11, 2021
Some questions about checkpoint in RLLib RLlib	1	314	May 23, 2023
Empty checkpoint files with Tune.run RLlib	1	381	March 30, 2022
[Tune] [RLlib] Save On Best Training Reward Callback RLlib	5	1073	August 8, 2022
Rllib checkpointing environment in Tune RLlib	1	419	June 2, 2022

How to save the best checkpoint of the training using RLLIB

Related topics