Nested checkpoint directories

The problem I have is that when generating session reports and saving checkpoints, the directories are being nested. For example: exp/checkpoint_01/checkpoint_02/… This nested structure becomes problematic as the number of saves increases, resulting in excessively long directory names. What am I doing wrong?

code for saving:

dir = session.get_trial_dir()
checkpoint = Checkpoint(local_path=dir), checkpoint=checkpoint)

Hi @0piero,

A few things:

  1. You should not checkpoint the trial directory. The trial directory is already available to you – so you should only checkpoint stuff training-related state that you want to output from within your training loop. Checkpoints are generated within the trial directory, which is what’s causing this nesting problem.

    Tune’s directory structure:

  2. You should use one of the official constructors of Checkpoint. Directly creating a checkpoint is a private API – you should instead use Checkpoint.from_directory in this case. See the docs below:

Here’s an example:

import os
import tempfile

from ray import tune

def train_fn(config):
    # ...
    tmpdir = tempfile.mkdtemp()
    with open(os.path.join(tmpdir, ""), "w") as f:, f)  # Save some file to this directory
    checkpoint = Checkpoint.from_directory(tmpdir), checkpoint=checkpoint)

tuner = tune.Tuner(train_fn)