Performance Bottleneck in saving model during training

WARNING util.py:244 – The process_trial_save operation took 14.823 s, which may be a performance bottleneck.

I’m facing this issue because the training loop is saving the model after each iteration, even if I set keep_checkpoints_num = 1 in tune.run. I’m using the function API and my checkpointing code part is as follows:

metrics = {"loss": val_loss, "running_loss": running_loss}

            checkpoint = Checkpoint(local_path=session.get_trial_dir())
            with checkpoint.as_directory() as checkpoint_dir:
                print("Model saving")
                for id, (model, opt) in enumerate(zip(model_type, optimizer)):
                    torch.save((model, opt.state_dict()), os.path.join(checkpoint_dir, "checkpoint" + str(id)))
                print("End saving")
                session.report({"loss": val_loss, "running_loss": running_loss}, checkpoint=checkpoint)

Hi @0piero,

keep_checkpoints_num only sets how many checkpoints should be kept on disk when a new checkpoint is written - it does not affect how often checkpoints are saved.

In fact, your code decides how often you write checkpoints. So if you want to write less often, you can use e.g. something like this:

for epoch in range(num_epochs):
    # ...
    metrics = {"loss": val_loss, "running_loss": running_loss}

    if epoch % 10 == 0:
        checkpoint = Checkpoint(local_path=session.get_trial_dir())
        # ...
    else:
        checkpoint = None

        session.report({"loss": val_loss, "running_loss": running_loss}, checkpoint=checkpoint)

This would save your checkpoint every 10 epochs

Thanks for the response. I tried doing this way, but it’s not possible with this approach to recover the best trial results and model parameters. Using the old API "with tune.checkpoint_dir " to store the model and saving checkpoints during training was not giving me this performance bottleneck. I thought if would be possible to only save model parameters if the chosen metric of an iteration is better than the past ones, but I can’t figure out how to use this functionality, is it possible?

Also when I use this approach the messages of “The process_trial_save operation took X s, which may be a performance bottleneck” keep showing.

How large are your checkpoints? On how many nodes are you running?

process_trial_save synchronizes checkpoints between the worker nodes and the head node. If you have very large checkpoints this can take time.

One possible mitigation here could be to use cloud storage (e.g. s3) and configure this in your SyncConfig. Another option would be to disable syncing by passing SyncConfig(syncer=None) to the Tuner.

For more help it would be good if you could share more information, e.g. number of nodes, checkpoint size, model type, the code where you call Tuner.fit.

Btw, you can still use the old API (tune.checkpoint_dir and tune.report), but it should be equivalent to the new API in terms of performance.