How to save the best checkpoint of the training using RLLIB

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am training a model in which the convergence is very unstable. I would like to know how to save the best checkpoint of the training using RLLIB. I am running the training using:

experiment_params = {
            "training": {
                "env": "wf",
                "run": args.algorithm,
                "stop": {"training_iteration": 2000
                         #"timesteps_total": 8000000,
                        
                        },
                "local_dir": "/opt/ml/output/intermediate",
                "checkpoint_at_end": True,
                "checkpoint_freq": 10,
                "config": {
                    "num_workers": int(os.cpu_count())-1,                
                    "lr": 0.0001,
                    "num_gpus":  num_gpus,
                    "gamma": float(args.gamma),
                    "seed":args.seed,
                },

            }
        }
    ray.tune.run_experiments(copy.deepcopy(experiment_params))

Hi @carlorop, with that configuration you should already see checkpoints being saved in a subdirectory of /opt/ml/output/intermediate. So they are already saved.

You can get the best checkpoint using e.g.

analysis = ray.tune.ExperimentAnalysis("/opt/ml/output/intermediate/training")
analysis.get_best_checkpoint(analysis.trials[0], "episode_reward_mean", "max")

(note the directory name might be slightly different)

See also Analysis (tune.analysis) — Ray 1.11.0

1 Like