Trial checkpointing

Jaydeen · June 16, 2023, 2:05pm

Hi!
I am in the process of updating my training from ray 1.13 to ray 2.5; Unfortunately, I have trouble understanding the updates checkpointing mechanism. To create checkpoints, I used to do

with tune.checkpoint_dir(step=epoch) as checkpoint_dir:
      # save everything I wanted manually here

This would save all data I wanted in a directory called “checkpoint_[epoch]” in my trials folder. Also, ray would know about the checkpoint automatically.
To load my checkpoint, I would do sth like this:

def my_trainable(config, checkpoint_dir = None):
        if checkpoint_dir is not None:
          model_path = os.path.join(checkpoint_dir, "model.pt")
          model_state = torch.load(model_path)
          start_epoch_num = last_epoch_num + 1 # last_epoch_num read from folder
          net.load_state_dict(model_state, map_location=device)
          net.to(device)
          ....

Basically, I am trying to achieve the same behavior with the new API. But:

If I use “Checkpoint a dictionary” I would need to fill everything I want to save in a python directory that is saved as one in the “checkpoint_[epoch]” folder, correct? I find this tedious, since I later would always need to load the whole dict when wanting to read partial results.
It seems I want to use “Checkpoint a directory”. However, why do I have to write in a “manual folder” and ray will later copy the results? When can I clean up my “manual folder” and do I need to make sure that its not named the same for simultaneously running trials?

Thank you!

Topic		Replies	Views
[train] When resuming training, a new `Trial` directory is created, even when resuming from checkpoint	4	432	September 29, 2022
How to set directory where checkpoints are saved	2	510	December 14, 2023
Issue in saving checkpoints	1	547	November 16, 2022
Empty checkpoint files with Tune.run RLlib	1	381	March 30, 2022
Do trial checkpoints need unique names? < pytorch tutorial> Ray Tune	3	479	February 10, 2023

Trial checkpointing

Related topics