Tune.run call looks like
analysis = tune.run(
tune.with_parameters(
models.trainers.train_ptl_checkpoint,
checkpoint_dir=model_config["checkpoint_dir"], #none
model_config=model_config, # model specific parameters
num_epochs=num_epochs,
num_gpus=gpus_per_trial,
report_on=report_on, # reporting frequency
checkpoint_on=report_on, # checkpointing frequency if different than reporting freq
),
resources_per_trial={"cpu": cpus_per_trial, "gpu": gpus_per_trial},
metric=model_config["metric"],
mode=model_config["mode"],
config=tune_config, #hyper parameters only
num_samples=num_samples, # 10
scheduler=scheduler, # optional trial scheduler
progress_reporter=reporter,
name=model_config["experiment_name"],
sync_config=sync_config, #docker sync config plus uploading to cloud storage
queue_trials=queue_trials, #true for distributed
fail_fast=True,
)
The callbacks being used inside train_ptl_checkpoint
# setup tune report callbacks
if report_on == checkpoint_on:
callbacks = [
TuneReportCheckpointCallback(
metrics=model_config["metrics"],
filename="checkpoint",
on=checkpoint_on,
)
]
else:
callbacks = [
TuneReportCallback(metrics=model_config["metrics"], on=report_on),
TuneReportCheckpointCallback(
metrics=model_config["metrics"],
filename="checkpoint",
on=checkpoint_on,
),
]
Will try and share a full example or similar working example if possible,
When running locally I get the following output after every training iteation
2021-01-22 13:10:04,241 WARNING util.py:143 -- The
process_trial_save operation took 97.929 s, which may be a performance bottleneck.
The model itself is quite large compared to the MNIST example, was thinking it could be time to save is significant? Which gets worse when transferring from workers to the head nodes ?