Tune Performance Bottlenecks

Hello, wondering if someone can offer me some advice regarding this. Using a PyTorch lightning module similar to the MNIST example but keep finding bottlenecks

2021-01-20 12:44:50,088 WARNING util.py:150 -- The `callbacks.on_trial_result` operation took 111.168 s, which may be a performance bottleneck.
2021-01-20 12:44:50,090 WARNING util.py:150 -- The `process_trial_result` operation took 111.170 s, which may be a performance bottleneck.
2021-01-20 12:44:50,090 WARNING util.py:150 -- Processing trial results took 111.170 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
2021-01-20 12:44:50,090 WARNING util.py:150 -- The `process_trial` operation took 111.170 s, which may be a performance bottleneck.

I saw speed ups after reducing the config file to tuneable hyper parameters only. I am using tune.with_parameters.

I also tried setting reporting frequency to 500s, but still getting the same issue.

My guess is that tune is using up a lot of time on checkpointing every validation loop?

Would creating a callback that only saves a checkpoint if performance improves on a metric be a viable solution?

In addition, because checkpoints are so frequent and large, syncing from the worker node to head node likely takes a long time?
When turning off syncing on checkpoint, things go awry and the experiment will randomly break.

Hmm, what does your tune.run call look like? If you have a full example, I’d love to take a look at it/run it!

I think checkpointing might not be a big issue because we actually don’t block on the checkpointing event. It seems like on_trial_result as defined by your callback is the bottleneck – would it be possible to post that too?

Tune.run call looks like

analysis = tune.run(
        tune.with_parameters(
            models.trainers.train_ptl_checkpoint,
            checkpoint_dir=model_config["checkpoint_dir"], #none 
            model_config=model_config, # model specific parameters 
            num_epochs=num_epochs, 
            num_gpus=gpus_per_trial,
            report_on=report_on, # reporting frequency 
            checkpoint_on=report_on, # checkpointing frequency if different than reporting freq
        ),
        resources_per_trial={"cpu": cpus_per_trial, "gpu": gpus_per_trial},
        metric=model_config["metric"],
        mode=model_config["mode"],
        config=tune_config, #hyper parameters only 
        num_samples=num_samples, # 10 
        scheduler=scheduler, # optional trial scheduler 
        progress_reporter=reporter,
        name=model_config["experiment_name"],
        sync_config=sync_config, #docker sync config plus uploading to cloud storage 
        queue_trials=queue_trials, #true for distributed 
        fail_fast=True,
    )

The callbacks being used inside train_ptl_checkpoint

# setup tune report callbacks
    if report_on == checkpoint_on:
        callbacks = [
            TuneReportCheckpointCallback(
                metrics=model_config["metrics"],
                filename="checkpoint",
                on=checkpoint_on,
            )
        ]
    else:
        callbacks = [
            TuneReportCallback(metrics=model_config["metrics"], on=report_on),
            TuneReportCheckpointCallback(
                metrics=model_config["metrics"],
                filename="checkpoint",
                on=checkpoint_on,
            ),
        ]

Will try and share a full example or similar working example if possible,
When running locally I get the following output after every training iteation

2021-01-22 13:10:04,241 WARNING util.py:143 -- The process_trial_save operation took 97.929 s, which may be a performance bottleneck.

The model itself is quite large compared to the MNIST example, was thinking it could be time to save is significant? Which gets worse when transferring from workers to the head nodes ?

Hmm, so seems like your model checkpoint is taking a long time. Is your model really large? Does it normally take a long time without Tune?

Hey Richard,

Apologies for the late follow-up,

  1. The model does take a long time without tune, I notice this becomes a lot better as I scale up, the model is very large so I guess this is expected.
  2. Testing out with a similar but much smaller model solved these problems locally

The easiest solution, for now, was to just implement a TopK checkpoint method similar to the existing checkpointing callback. Though it’s still relatively inefficient, I need to make it more similar to the checkpointing functionality in the trainable API.

can submit a PR for this if it’d be helpful!

Hey @Raed thanks for following up!

Does the keep_checkpoints_num parameter of tune.run work for you? It essentially implements a TopK thing for the checkpoints.

1 Like

Hello Richard,

That parameter did work for me, for some reason I thought it was only for the Trainable API.

Just an observation, I do get this warning message (even before setting the parameter)

2021-02-04 18:13:22,924 WARNING function_runner.py:541 – Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be func(config, checkpoint_dir=None).

However I suspect this warning is faulty, I manually verified and found checkpoints had been saved, my call to tune had more parameters passed in after checkpoint_dir = None

Thank you for all your help!

Ah ok! @Raed could you post a quick issue about this warning on Github? I think this should be a fast fix on our side, we just need to track it :slight_smile:

Created False Checkpoint Warning with tune.with_parameters() [tune] · Issue #13998 · ray-project/ray · GitHub