Hi all, I’m trying to checkpoint only the best iterations of my model, but when I check, only the first 5 checkpoints (because of keep_checkpoint_num=5
) and the last one are saved, like so:
checkpoint_010001 checkpoint_010003 checkpoint_010005 events.out.tfevents.1634291196.LAPTOP-7VGTS0VK params.pkl result.json
checkpoint_010002 checkpoint_010004 checkpoint_013663 params.json progress.csv
My tune.run
call:
scheduler = AsyncHyperBandScheduler(
time_attr="training_iteration",
grace_period=5 * 60,
max_t=1000000 * 60,
)
print("Training automatically with Ray Tune")
analysis = tune.run(
args.run,
config=config,
stop=stop,
checkpoint_freq=1,
keep_checkpoints_num=5,
checkpoint_score_attr="episode_reward_mean",
metric="episode_reward_mean",
mode="max",
callbacks=[
WandbLoggerCallback(
group=name_run(config, ""),
api_key_file=".wandb_api_key",
project="egt-rl",
),
],
scheduler=scheduler,
name=name_run(config, ""),
)
Any idea why this is happening? Intended behavior is saving the 5-best models by episode_reward_mean. Keeping the last one too.