Nested checkpoint directories

0piero · June 13, 2023, 5:30pm

The problem I have is that when generating session reports and saving checkpoints, the directories are being nested. For example: exp/checkpoint_01/checkpoint_02/… This nested structure becomes problematic as the number of saves increases, resulting in excessively long directory names. What am I doing wrong?

code for saving:

dir = session.get_trial_dir()
checkpoint = Checkpoint(local_path=dir)
session.report(metrics, checkpoint=checkpoint)

justinvyu · June 13, 2023, 6:01pm

Hi @0piero,

A few things:

You should not checkpoint the trial directory. The trial directory is already available to you – so you should only checkpoint stuff training-related state that you want to output from within your training loop. Checkpoints are generated within the trial directory, which is what’s causing this nesting problem.

Tune’s directory structure:
```
experiment_dir/
    trial_dir/
        checkpoint_0/
        checkpoint_1/
        ...
```
You should use one of the official constructors of Checkpoint. Directly creating a checkpoint is a private API – you should instead use Checkpoint.from_directory in this case. See the docs below:

https://docs.ray.io/en/latest/ray-air/api/doc/ray.air.checkpoint.Checkpoint.from_directory.html

Here’s an example:

import os
import tempfile

from ray import tune

def train_fn(config):
    # ...
    tmpdir = tempfile.mkdtemp()
    with open(os.path.join(tmpdir, "model.pt"), "w") as f:
        torch.save(..., f)  # Save some file to this directory
    checkpoint = Checkpoint.from_directory(tmpdir)
    session.report(..., checkpoint=checkpoint)

tuner = tune.Tuner(train_fn)
tuner.fit()

Topic		Replies	Views
Trial checkpointing	0	286	June 16, 2023
[train] When resuming training, a new `Trial` directory is created, even when resuming from checkpoint	4	431	September 29, 2022
How to set directory where checkpoints are saved	2	493	December 14, 2023
Issue in saving checkpoints	1	545	November 16, 2022
Do trial checkpoints need unique names? < pytorch tutorial> Ray Tune	3	478	February 10, 2023

Nested checkpoint directories

Related topics