Tune Performance Bottlenecks

Raed · January 22, 2021, 6:47pm

Tune.run call looks like

analysis = tune.run(
        tune.with_parameters(
            models.trainers.train_ptl_checkpoint,
            checkpoint_dir=model_config["checkpoint_dir"], #none 
            model_config=model_config, # model specific parameters 
            num_epochs=num_epochs, 
            num_gpus=gpus_per_trial,
            report_on=report_on, # reporting frequency 
            checkpoint_on=report_on, # checkpointing frequency if different than reporting freq
        ),
        resources_per_trial={"cpu": cpus_per_trial, "gpu": gpus_per_trial},
        metric=model_config["metric"],
        mode=model_config["mode"],
        config=tune_config, #hyper parameters only 
        num_samples=num_samples, # 10 
        scheduler=scheduler, # optional trial scheduler 
        progress_reporter=reporter,
        name=model_config["experiment_name"],
        sync_config=sync_config, #docker sync config plus uploading to cloud storage 
        queue_trials=queue_trials, #true for distributed 
        fail_fast=True,
    )

The callbacks being used inside train_ptl_checkpoint

# setup tune report callbacks
    if report_on == checkpoint_on:
        callbacks = [
            TuneReportCheckpointCallback(
                metrics=model_config["metrics"],
                filename="checkpoint",
                on=checkpoint_on,
            )
        ]
    else:
        callbacks = [
            TuneReportCallback(metrics=model_config["metrics"], on=report_on),
            TuneReportCheckpointCallback(
                metrics=model_config["metrics"],
                filename="checkpoint",
                on=checkpoint_on,
            ),
        ]

Will try and share a full example or similar working example if possible,
When running locally I get the following output after every training iteation

2021-01-22 13:10:04,241 WARNING util.py:143 -- The process_trial_save operation took 97.929 s, which may be a performance bottleneck.

The model itself is quite large compared to the MNIST example, was thinking it could be time to save is significant? Which gets worse when transferring from workers to the head nodes ?

Topic		Replies	Views
[Tune] Control TuneSearchCV reporting Ray Tune	6	641	May 27, 2021
Trying to optimize training but finding documentation insufficient RLlib	6	607	September 11, 2022
Ray Tune event loop backlogged, slow with checkpointing Ray Tune	7	1488	September 28, 2021
Performance Bottleneck in saving model during training Ray Tune	4	403	January 31, 2023
How to debug performance bottlenecks	7	2142	March 18, 2021

Tune Performance Bottlenecks

Related Topics