The problem I have is that when generating session reports and saving checkpoints, the directories are being nested. For example: exp/checkpoint_01/checkpoint_02/… This nested structure becomes problematic as the number of saves increases, resulting in excessively long directory names. What am I doing wrong?
code for saving:
dir = session.get_trial_dir()
checkpoint = Checkpoint(local_path=dir)
session.report(metrics, checkpoint=checkpoint)
You should not checkpoint the trial directory. The trial directory is already available to you – so you should only checkpoint stuff training-related state that you want to output from within your training loop. Checkpoints are generated within the trial directory, which is what’s causing this nesting problem.
You should use one of the official constructors of Checkpoint. Directly creating a checkpoint is a private API – you should instead use Checkpoint.from_directory in this case. See the docs below:
import os
import tempfile
from ray import tune
def train_fn(config):
# ...
tmpdir = tempfile.mkdtemp()
with open(os.path.join(tmpdir, "model.pt"), "w") as f:
torch.save(..., f) # Save some file to this directory
checkpoint = Checkpoint.from_directory(tmpdir)
session.report(..., checkpoint=checkpoint)
tuner = tune.Tuner(train_fn)
tuner.fit()