Performance Bottleneck in saving model during training

0piero · January 27, 2023, 9:56am

WARNING util.py:244 – The process_trial_save operation took 14.823 s, which may be a performance bottleneck.

I’m facing this issue because the training loop is saving the model after each iteration, even if I set keep_checkpoints_num = 1 in tune.run. I’m using the function API and my checkpointing code part is as follows:

metrics = {"loss": val_loss, "running_loss": running_loss}

            checkpoint = Checkpoint(local_path=session.get_trial_dir())
            with checkpoint.as_directory() as checkpoint_dir:
                print("Model saving")
                for id, (model, opt) in enumerate(zip(model_type, optimizer)):
                    torch.save((model, opt.state_dict()), os.path.join(checkpoint_dir, "checkpoint" + str(id)))
                print("End saving")
                session.report({"loss": val_loss, "running_loss": running_loss}, checkpoint=checkpoint)

kai · January 30, 2023, 6:22pm

Hi @0piero,

keep_checkpoints_num only sets how many checkpoints should be kept on disk when a new checkpoint is written - it does not affect how often checkpoints are saved.

In fact, your code decides how often you write checkpoints. So if you want to write less often, you can use e.g. something like this:

for epoch in range(num_epochs):
    # ...
    metrics = {"loss": val_loss, "running_loss": running_loss}

    if epoch % 10 == 0:
        checkpoint = Checkpoint(local_path=session.get_trial_dir())
        # ...
    else:
        checkpoint = None

        session.report({"loss": val_loss, "running_loss": running_loss}, checkpoint=checkpoint)

This would save your checkpoint every 10 epochs

0piero · January 31, 2023, 5:54am

Thanks for the response. I tried doing this way, but it’s not possible with this approach to recover the best trial results and model parameters. Using the old API "with tune.checkpoint_dir " to store the model and saving checkpoints during training was not giving me this performance bottleneck. I thought if would be possible to only save model parameters if the chosen metric of an iteration is better than the past ones, but I can’t figure out how to use this functionality, is it possible?

0piero · January 31, 2023, 6:53am

Also when I use this approach the messages of “The process_trial_save operation took X s, which may be a performance bottleneck” keep showing.

kai · January 31, 2023, 4:48pm

How large are your checkpoints? On how many nodes are you running?

process_trial_save synchronizes checkpoints between the worker nodes and the head node. If you have very large checkpoints this can take time.

One possible mitigation here could be to use cloud storage (e.g. s3) and configure this in your SyncConfig. Another option would be to disable syncing by passing SyncConfig(syncer=None) to the Tuner.

For more help it would be good if you could share more information, e.g. number of nodes, checkpoint size, model type, the code where you call Tuner.fit.

Btw, you can still use the old API (tune.checkpoint_dir and tune.report), but it should be equivalent to the new API in terms of performance.

Topic		Replies	Views
The `process_trial_save` operation took X s, which may be a performance bottleneck Checkpointing, Restoring	1	534	March 8, 2023
How to debug performance bottlenecks	7	2427	March 18, 2021
Tune Performance Bottlenecks Ray Tune	8	3594	February 8, 2021
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	499	October 18, 2021
How to save model during tuning Checkpointing, Restoring	0	345	January 8, 2024

Performance Bottleneck in saving model during training

Related topics