Analysis get_best_checkpoint returning None

0piero · January 20, 2023, 10:16am

I’m having trouble when trying to retrieve the trial best checkpoint, this is an example inside the training loop where I’m checkpointing my metrics

metrics = {"loss": val_loss, "running_loss": running_loss}
            with tune.checkpoint_dir(epoch) as checkpoint_dir:
                path = os.path.join(checkpoint_dir, "checkpoint")
                torch.save((model, optimizer.state_dict()), path)
                with open(path, "w") as f:
                    f.write(json.dumps(metrics))
            tune.report(metrics)

when tryind result.get_best_checkpoint it returns None, what am I doing wrong?

arturn · February 8, 2023, 8:13pm

Hi @0piero ,

Please refer to the example script for checkpointing by custom criteria: ray/checkpoint_by_custom_criteria.py at master · ray-project/ray · GitHub

If this does not help you, could you please post this as a reproducible script?
What you are doing is not 100% clear to me.

Thanks you

Topic		Replies	Views
Ray restore checkpoint in rllib RLlib	6	1647	August 11, 2021
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	499	October 18, 2021
Best model based on Checkpoint not Last epoch Ray Tune	10	1666	April 24, 2021
Ray.train.get_checkpoint() don't get my reported checkpoint Dashboard, Monitoring & Debugging	3	34	August 6, 2024
Store best checkpoints according to evaluation metrics Checkpointing, Restoring	0	384	June 19, 2023

Analysis get_best_checkpoint returning None

Related topics