Saving best model at the end of the training

Hi guys, I am trying to run hyperparamsearch using tune.with_parameters, because my data is too big. My training function takes parameters config, data and checkpoint_dir - I am currently saving the model after every trial. Is there a way, using tune.run and tune.with_parameters to save the best model on the disk after all the trials are run? Can I achieve it with checkpointing? My first idea was to use Trainable class like in this example: [tune] How to checkpoint best model · Issue #10290 · ray-project/ray · GitHub, but as far as I understand I can’t mix up Trainable class and tune.with_parameters. Moreover when I tried, it didn’t work.

my current solution code snippet, where config - pipeline hyperparameters (for transformers and models):

def train_model(config, data, checkpoint_dir=None):
    (train_data, y_train, dev_data, y_valid) = data
    model = ModelPipeline.from_config(config)

    model.fit(train_data, y_train, validation_data=(dev_data, y_valid))
    dev_metric_results = Evaluation(metrics=['custom_metrics']) \
        .evaluate(model=model, X=dev_data, y_true=y_valid)

    with tune.checkpoint_dir(step=0) as checkpoint_dir:
        model.save(checkpoint_dir)

    tune.report(custom_metrics=dev_metric_results['custom_metrics'])
        
analysis = tune.run(tune.with_parameters(train_model, data=data),
            name = name,
            config = config,
            num_samples=num_samples,
            time_budget_s = time_budget,
            verbose = verbose,
            resources_per_trial = resources,
            metric = 'custom_metrics',
            mode = 'max',
            keep_checkpoints_num=1,
            checkpoint_freq=1,
            checkpoint_score_attr='custom_metrics')

Hi, you can access the checkpoint of the best performing trial like this:

best_checkpoint_dir = analysis.best_checkpoint

For more information you can take a look here: Analysis (tune.analysis) — Ray v1.1.0

1 Like

Hi, thanks for the answer.My question isn’t about finding the path of the best checkpoint (which would still require checkpointing model after each trial), but to have only one model (the best one) saved at the disk, when all the trials are done.

1 Like

I am afraid this is not currently supported. There is at least one checkpoint per trial.

Would it work if you have a separate customized process to monitor trial results and proactively delete checkpoints of trials less performant?

In the long run, it may be helpful for tune to provide something API to interact with ongoing experiment (deleting less performant trial checkpoints can be instrumented using such APIs.)

May be the it is not supported but in my case, i save the model according to f1 using Checkpoint Config:

tuner = tune.Tuner(
    trainable_with_resources,
    param_space=search_space,
    tune_config=tune.TuneConfig(
        num_samples=1,
        mode='max',
        metric='eval_seq_f1'
        # scheduler=scheduler,
    ),
    run_config=RunConfig(
        name="tune_transformer_pbt",
        storage_path='/data-gpu/trungct/tmp',
        log_to_file=True,
        progress_reporter=reporter,
        checkpoint_config=CheckpointConfig(
            num_to_keep=2,
            checkpoint_score_attribute="eval_seq_f1",
            checkpoint_score_order='max',
        ),
    ),
)

And then the best checkpoint paths (defined in num_to_keep), … is available at:

tuner = Tuner(...)
tuner.fit()
results.get_best_result(scope='all').best_checkpoints