Saving best checkpoint - tune is saving first iterations instead

TheExGenesis · October 15, 2021, 11:19am

Hi all, I’m trying to checkpoint only the best iterations of my model, but when I check, only the first 5 checkpoints (because of keep_checkpoint_num=5) and the last one are saved, like so:

checkpoint_010001  checkpoint_010003  checkpoint_010005  events.out.tfevents.1634291196.LAPTOP-7VGTS0VK  params.pkl    result.json
checkpoint_010002  checkpoint_010004  checkpoint_013663  params.json                                     progress.csv

My tune.run call:

        scheduler = AsyncHyperBandScheduler(
            time_attr="training_iteration",
            grace_period=5 * 60,
            max_t=1000000 * 60,
        )

        print("Training automatically with Ray Tune")
        analysis = tune.run(
            args.run,
            config=config,
            stop=stop,
            checkpoint_freq=1,
            keep_checkpoints_num=5,
            checkpoint_score_attr="episode_reward_mean",
            metric="episode_reward_mean",
            mode="max",
            callbacks=[
                WandbLoggerCallback(
                    group=name_run(config, ""),
                    api_key_file=".wandb_api_key",
                    project="egt-rl",
                ),
            ],
            scheduler=scheduler,
            name=name_run(config, ""),
        )

Any idea why this is happening? Intended behavior is saving the 5-best models by episode_reward_mean. Keeping the last one too.

kai · October 18, 2021, 11:43am

What kind of trainable are you training (or environment if using rllib)? Does your run converge (i.e. are you seeing higher rewards over time)?

Topic		Replies	Views
Which attributes can be used in `checkpoint_score_attr` when using `tune.run` RLlib	10	1213	April 20, 2022
Trouble with some results from Ray Tune	1	42	August 7, 2024
How to save model during tuning Checkpointing, Restoring	0	344	January 8, 2024
Empty checkpoint files with Tune.run RLlib	1	387	March 30, 2022
Resuming Trials with New Checkpoint_Score_Attr / Best Metric	0	444	January 4, 2022

Saving best checkpoint - tune is saving first iterations instead

Related topics