Nested checkpoint directories

The problem I have is that when generating session reports and saving checkpoints, the directories are being nested. For example: exp/checkpoint_01/checkpoint_02/… This nested structure becomes problematic as the number of saves increases, resulting in excessively long directory names. What am I doing wrong?

code for saving:

dir = session.get_trial_dir()
checkpoint = Checkpoint(local_path=dir)
session.report(metrics, checkpoint=checkpoint)

Hi @0piero,

A few things:

  1. You should not checkpoint the trial directory. The trial directory is already available to you – so you should only checkpoint stuff training-related state that you want to output from within your training loop. Checkpoints are generated within the trial directory, which is what’s causing this nesting problem.

    Tune’s directory structure:

    experiment_dir/
        trial_dir/
            checkpoint_0/
            checkpoint_1/
            ...
    
  2. You should use one of the official constructors of Checkpoint. Directly creating a checkpoint is a private API – you should instead use Checkpoint.from_directory in this case. See the docs below:

https://docs.ray.io/en/latest/ray-air/api/doc/ray.air.checkpoint.Checkpoint.from_directory.html

Here’s an example:

import os
import tempfile

from ray import tune

def train_fn(config):
    # ...
    tmpdir = tempfile.mkdtemp()
    with open(os.path.join(tmpdir, "model.pt"), "w") as f:
        torch.save(..., f)  # Save some file to this directory
    checkpoint = Checkpoint.from_directory(tmpdir)
    session.report(..., checkpoint=checkpoint)

tuner = tune.Tuner(train_fn)
tuner.fit()