Trial checkpointing

I am in the process of updating my training from ray 1.13 to ray 2.5; Unfortunately, I have trouble understanding the updates checkpointing mechanism. To create checkpoints, I used to do

with tune.checkpoint_dir(step=epoch) as checkpoint_dir:
      # save everything I wanted manually here

This would save all data I wanted in a directory called “checkpoint_[epoch]” in my trials folder. Also, ray would know about the checkpoint automatically.
To load my checkpoint, I would do sth like this:

def my_trainable(config, checkpoint_dir = None):
        if checkpoint_dir is not None:
          model_path = os.path.join(checkpoint_dir, "")
          model_state = torch.load(model_path)
          start_epoch_num = last_epoch_num + 1 # last_epoch_num read from folder
          net.load_state_dict(model_state, map_location=device)


Basically, I am trying to achieve the same behavior with the new API. But:

  1. If I use “Checkpoint a dictionary” I would need to fill everything I want to save in a python directory that is saved as one in the “checkpoint_[epoch]” folder, correct? I find this tedious, since I later would always need to load the whole dict when wanting to read partial results.
  2. It seems I want to use “Checkpoint a directory”. However, why do I have to write in a “manual folder” and ray will later copy the results? When can I clean up my “manual folder” and do I need to make sure that its not named the same for simultaneously running trials?

Thank you!