Tuner.fit().get_best_result has no checkpoints (None)

Akarsh_Bhagavath · March 10, 2023, 12:13am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am currently training a lightgbm model with a trainer as follows:

    trainer = LightGBMTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=1,
        # Whether to use GPU acceleration.
        use_gpu=False,
        resources_per_worker={"CPU": 8},
    ),
    label_column="y",
    num_boost_round=650,
    params={
        "learning_rate" : 0.05,
        "objective": "multiclass",
        "num_class" : 3,
        "metric": ["multi_error"],
    },
    datasets={"train": small_train_dataset, "valid": small_eval_dataset},
    )
    result = trainer.fit()

I am able to access the checkpoint from ‘result’. Now I want to do a hyperparameter search on the trainer. Here is what I wrote:

search_space = {“params” : {
“learning_rate”: tune.loguniform(0.01, 0.5),
“max_depth”: tune.randint(1, 30),
“num_leaves”: tune.randint(10, 200),
“feature_fraction”: tune.uniform(0.1, 1.0),
“subsample”: tune.uniform(0.1, 1.0)
}
}

    # Define the scheduler and search algorithm
    scheduler = ASHAScheduler(max_t=6, grace_period=1)
    tuner = Tuner(
        trainer,
        param_space=search_space,
        run_config=air.RunConfig(
            name="example-experiment",
            checkpoint_config=air.CheckpointConfig(checkpoint_frequency=1),
            ),
        tune_config=tune.TuneConfig(
            metric="valid-multi_error",
            mode="max",
            num_samples=5,
            scheduler=scheduler
        )
    )

result_grid = tuner.fit()
best_result = result_grid.get_best_result(metric=“valid-multi_error”, mode=“min”)

when I try to access best_result.checkpoints, the value is ‘None’, when I try to access best_result.best_checkpoints, that value is also an empty list , I am not sure why I don’t have a checkpoint as the docs seemed pretty straightforward on how to access it.

gjoliver · March 14, 2023, 5:16am

I will hopefully get to the bottom of this. but for now, you can specify RunConfig directly on the Trainer to work around this.
For example, the following code checkpoints at every iteration and at the end of the Trial:

train_dataset, eval_dataset = prepare_data()


trainer = LightGBMTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=1,
        # Whether to use GPU acceleration.
        use_gpu=False,
        resources_per_worker={"CPU": 2},
    ),
    label_column="target",
    num_boost_round=650,
    params={
        "learning_rate" : 0.05,
        "objective": "multiclass",
        "num_class" : 3,
        "metric": ["multi_error"],
    },
    datasets={
        "train": train_dataset,
        "valid": eval_dataset,
    },
    run_config=RunConfig(
        name="example-experiment",
        checkpoint_config=CheckpointConfig(
            checkpoint_frequency=1,
            checkpoint_at_end=True,
        ),
    ),
)


# Define the scheduler and search algorithm
scheduler = ASHAScheduler(max_t=6, grace_period=1)

search_space = {
    "params": {
        "learning_rate": tune.loguniform(0.01, 0.5),
        "max_depth": tune.randint(1, 30),
        "num_leaves": tune.randint(10, 200),
        "feature_fraction": tune.uniform(0.1, 1.0),
        "subsample": tune.uniform(0.1, 1.0)
    }
}

tuner = Tuner(
    trainer,
    param_space=search_space,
    tune_config=tune.TuneConfig(
        metric="valid-multi_error",
        mode="max",
        num_samples=2,
        scheduler=scheduler
    )
)


result = tuner.fit()
best_result = result.get_best_result(
    metric="valid-multi_error", mode="min",
)

print(best_result)
print(best_result.checkpoint)
print(best_result.best_checkpoints)

Jules_Damji · March 14, 2023, 4:30pm

@Akarsh_Bhagavath does @gjoliver suggestion help?
cc: @Yard1 @xwjiang2010

Akarsh_Bhagavath · March 14, 2023, 10:00pm

The best result checkpoint was still None. I tried out a few things over the weekend and I think this is an issue with the ASHAScheduler. I tried using the Median Stopping Rule and it worked fine.

Scott_Zhang · August 26, 2024, 3:37pm

i use both ASHAScheduler and Median Stopping Rule, checkpoint was still None

Topic		Replies	Views
Tuner not returning best checkpoint Ray Tune	3	382	April 25, 2023
RAY tune does not save checkpoint information under experiment path	0	108	April 7, 2024
AttributeError: '_TrackedCheckpoint' object has no attribute 'value' Ray Tune	2	625	January 10, 2023
Best model based on Checkpoint not Last epoch Ray Tune	10	1676	April 24, 2021
[Tune] How to turn off checkpointing for testing Ray Tune	20	3138	April 18, 2023

Tuner.fit().get_best_result has no checkpoints (None)

Related topics