Tuner.fit().get_best_result has no checkpoints (None)

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am currently training a lightgbm model with a trainer as follows:

    trainer = LightGBMTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=1,
        # Whether to use GPU acceleration.
        use_gpu=False,
        resources_per_worker={"CPU": 8},
    ),
    label_column="y",
    num_boost_round=650,
    params={
        "learning_rate" : 0.05,
        "objective": "multiclass",
        "num_class" : 3,
        "metric": ["multi_error"],
    },
    datasets={"train": small_train_dataset, "valid": small_eval_dataset},
    )
    result = trainer.fit()

I am able to access the checkpoint from ‘result’. Now I want to do a hyperparameter search on the trainer. Here is what I wrote:

search_space = {“params” : {
“learning_rate”: tune.loguniform(0.01, 0.5),
“max_depth”: tune.randint(1, 30),
“num_leaves”: tune.randint(10, 200),
“feature_fraction”: tune.uniform(0.1, 1.0),
“subsample”: tune.uniform(0.1, 1.0)
}
}

    # Define the scheduler and search algorithm
    scheduler = ASHAScheduler(max_t=6, grace_period=1)
    tuner = Tuner(
        trainer,
        param_space=search_space,
        run_config=air.RunConfig(
            name="example-experiment",
            checkpoint_config=air.CheckpointConfig(checkpoint_frequency=1),
            ),
        tune_config=tune.TuneConfig(
            metric="valid-multi_error",
            mode="max",
            num_samples=5,
            scheduler=scheduler
        )
    )

result_grid = tuner.fit()
best_result = result_grid.get_best_result(metric=“valid-multi_error”, mode=“min”)

when I try to access best_result.checkpoints, the value is ‘None’, when I try to access best_result.best_checkpoints, that value is also an empty list , I am not sure why I don’t have a checkpoint as the docs seemed pretty straightforward on how to access it.

I will hopefully get to the bottom of this. but for now, you can specify RunConfig directly on the Trainer to work around this.
For example, the following code checkpoints at every iteration and at the end of the Trial:

train_dataset, eval_dataset = prepare_data()


trainer = LightGBMTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=1,
        # Whether to use GPU acceleration.
        use_gpu=False,
        resources_per_worker={"CPU": 2},
    ),
    label_column="target",
    num_boost_round=650,
    params={
        "learning_rate" : 0.05,
        "objective": "multiclass",
        "num_class" : 3,
        "metric": ["multi_error"],
    },
    datasets={
        "train": train_dataset,
        "valid": eval_dataset,
    },
    run_config=RunConfig(
        name="example-experiment",
        checkpoint_config=CheckpointConfig(
            checkpoint_frequency=1,
            checkpoint_at_end=True,
        ),
    ),
)


# Define the scheduler and search algorithm
scheduler = ASHAScheduler(max_t=6, grace_period=1)

search_space = {
    "params": {
        "learning_rate": tune.loguniform(0.01, 0.5),
        "max_depth": tune.randint(1, 30),
        "num_leaves": tune.randint(10, 200),
        "feature_fraction": tune.uniform(0.1, 1.0),
        "subsample": tune.uniform(0.1, 1.0)
    }
}

tuner = Tuner(
    trainer,
    param_space=search_space,
    tune_config=tune.TuneConfig(
        metric="valid-multi_error",
        mode="max",
        num_samples=2,
        scheduler=scheduler
    )
)


result = tuner.fit()
best_result = result.get_best_result(
    metric="valid-multi_error", mode="min",
)

print(best_result)
print(best_result.checkpoint)
print(best_result.best_checkpoints)

@Akarsh_Bhagavath does @gjoliver suggestion help?
cc: @Yard1 @xwjiang2010

The best result checkpoint was still None. I tried out a few things over the weekend and I think this is an issue with the ASHAScheduler. I tried using the Median Stopping Rule and it worked fine.

i use both ASHAScheduler and Median Stopping Rule, checkpoint was still None