I am in the process of updating my training from ray 1.13 to ray 2.5; Unfortunately, I have trouble understanding the updates checkpointing mechanism. To create checkpoints, I used to do
with tune.checkpoint_dir(step=epoch) as checkpoint_dir: # save everything I wanted manually here
This would save all data I wanted in a directory called “checkpoint_[epoch]” in my trials folder. Also, ray would know about the checkpoint automatically.
To load my checkpoint, I would do sth like this:
def my_trainable(config, checkpoint_dir = None): if checkpoint_dir is not None: model_path = os.path.join(checkpoint_dir, "model.pt") model_state = torch.load(model_path) start_epoch_num = last_epoch_num + 1 # last_epoch_num read from folder net.load_state_dict(model_state, map_location=device) net.to(device) ....
Basically, I am trying to achieve the same behavior with the new API. But:
- If I use “Checkpoint a dictionary” I would need to fill everything I want to save in a python directory that is saved as one in the “checkpoint_[epoch]” folder, correct? I find this tedious, since I later would always need to load the whole dict when wanting to read partial results.
- It seems I want to use “Checkpoint a directory”. However, why do I have to write in a “manual folder” and ray will later copy the results? When can I clean up my “manual folder” and do I need to make sure that its not named the same for simultaneously running trials?