Tune Performance Bottlenecks

Tune.run call looks like

analysis = tune.run(
        tune.with_parameters(
            models.trainers.train_ptl_checkpoint,
            checkpoint_dir=model_config["checkpoint_dir"], #none 
            model_config=model_config, # model specific parameters 
            num_epochs=num_epochs, 
            num_gpus=gpus_per_trial,
            report_on=report_on, # reporting frequency 
            checkpoint_on=report_on, # checkpointing frequency if different than reporting freq
        ),
        resources_per_trial={"cpu": cpus_per_trial, "gpu": gpus_per_trial},
        metric=model_config["metric"],
        mode=model_config["mode"],
        config=tune_config, #hyper parameters only 
        num_samples=num_samples, # 10 
        scheduler=scheduler, # optional trial scheduler 
        progress_reporter=reporter,
        name=model_config["experiment_name"],
        sync_config=sync_config, #docker sync config plus uploading to cloud storage 
        queue_trials=queue_trials, #true for distributed 
        fail_fast=True,
    )

The callbacks being used inside train_ptl_checkpoint

# setup tune report callbacks
    if report_on == checkpoint_on:
        callbacks = [
            TuneReportCheckpointCallback(
                metrics=model_config["metrics"],
                filename="checkpoint",
                on=checkpoint_on,
            )
        ]
    else:
        callbacks = [
            TuneReportCallback(metrics=model_config["metrics"], on=report_on),
            TuneReportCheckpointCallback(
                metrics=model_config["metrics"],
                filename="checkpoint",
                on=checkpoint_on,
            ),
        ]

Will try and share a full example or similar working example if possible,
When running locally I get the following output after every training iteation

2021-01-22 13:10:04,241 WARNING util.py:143 -- The process_trial_save operation took 97.929 s, which may be a performance bottleneck.

The model itself is quite large compared to the MNIST example, was thinking it could be time to save is significant? Which gets worse when transferring from workers to the head nodes ?