Ray tune metrics not consistent with offline evaluation

dbhdbh · April 6, 2023, 7:25pm

I use ray tune for lightgbm model training. The val and test result I got from the ResultGrid is different from offline run with the same model iteration and dataset. I wonder how ray tune does the evaluation. Does it use the full val set or a shard? Thanks!

xwjiang2010 · April 6, 2023, 9:42pm

can you paste your script? what do you mean by offline evaluation in this context?

dbhdbh · April 7, 2023, 1:16pm

What I meant by offline evaluation is loading the model and apply it to val and test set.

For example:
ckpt = LightGBMCheckpoint.from_uri(ckpt_dir)
batch_predictor = BatchPredictor.from_checkpoint(
ckpt, LightGBMPredictor
)
prob = batch_predictor.predict(dataset, feature_columns=feat_cols, keep_columns=[‘event_label’], num_iteration=501)
avg_prec = average_precision(y_true, prob)

This is different from what I got from the progress.csv in the same ckpt directory:
progress = pd.read_csv(f"{trial_dir}/progress.csv")
and check the corresponding performance at the same num_iteration 501.

The progress performance is aligned with the result grid from the tuner.

xwjiang2010 · April 7, 2023, 3:38pm

how is evaluation done in the training loop? How is evaluation metrics reported in the training loop?

dbhdbh · April 7, 2023, 4:40pm

I am using

from ray.train.lightgbm import LightGBMTrainer

trainer = LightGBMTrainer(
datasets={“train”: train_dataset, “val”: val_dataset, “test”: test_dataset},
label_column=cfg.data.label_col,
params={},
dmatrix_params={
‘train’: {‘weight’: cfg.data.weight_col},
‘val’: {‘weight’: cfg.data.weight_col},
‘test’: {‘weight’: cfg.data.weight_col}
},
eval_metric=eval_func,
scaling_config=ScalingConfig(
num_workers=cfg.task.num_workers,
resources_per_worker={“CPU”: cfg.task.num_cpus_per_workers},
trainer_resources={“CPU”: 0},
use_gpu=False,
placement_strategy=“SPREAD”,
_max_cpu_fraction_per_node=0.8,
),
)

The training loop is defined in
***/lib/python3.7/site-packages/ray/train/gbdt_trainer.py

def training_loop(self) → None:
config = self.train_kwargs.copy()
dmatrices = self._get_dmatrices(
dmatrix_params=self.dmatrix_params,
)
train_dmatrix = dmatrices[TRAIN_DATASET_KEY]
evals_result = {}

    init_model = None
    if self.resume_from_checkpoint:
        init_model, _ = self._load_checkpoint(self.resume_from_checkpoint)

    config.setdefault("verbose_eval", False)
    config.setdefault("callbacks", [])

    if not any(
        isinstance(
            cb, (self._tune_callback_report_cls, self._tune_callback_checkpoint_cls)
        )
        for cb in config["callbacks"]
    ):
        # Only add our own callback if it hasn't been added before
        checkpoint_frequency = (
            self.run_config.checkpoint_config.checkpoint_frequency
        )
        if checkpoint_frequency > 0:
            callback = self._tune_callback_checkpoint_cls(
                filename=MODEL_KEY, frequency=checkpoint_frequency
            )
        else:
            callback = self._tune_callback_report_cls()

        config["callbacks"] += [callback]

    config[self._init_model_arg_name] = init_model

    model = self._train(
        params=self.params,
        dtrain=train_dmatrix,
        evals_result=evals_result,
        evals=[(dmatrix, k) for k, dmatrix in dmatrices.items()],
        ray_params=self._ray_params,
        **config,
    )

    checkpoint_at_end = self.run_config.checkpoint_config.checkpoint_at_end
    if checkpoint_at_end is None:
        checkpoint_at_end = True

    if checkpoint_at_end:
        self._checkpoint_at_end(model, evals_result)

dbhdbh · April 7, 2023, 4:44pm

I used

trainer = LightGBMTrainer(
        datasets={"train": train_dataset, "val": val_dataset, "test": test_dataset},
        label_column=cfg.data.label_col,
        params={},
        dmatrix_params={
                'train': {'weight': cfg.data.weight_col},
                'val': {'weight': cfg.data.weight_col},
                'test': {'weight': cfg.data.weight_col}
                },
        eval_metric=eval_func,
        scaling_config=ScalingConfig(
            num_workers=cfg.task.num_workers,
            resources_per_worker={"CPU": cfg.task.num_cpus_per_workers},
            trainer_resources={"CPU": 0},
            use_gpu=False,
            placement_strategy="SPREAD",
            _max_cpu_fraction_per_node=0.8,
        ),
    )

The training loop is in the /lib/python3.7/site-packages/ray/train/gbdt_trainer.py

def training_loop(self) -> None:
        config = self.train_kwargs.copy()

        dmatrices = self._get_dmatrices(
            dmatrix_params=self.dmatrix_params,
        )
        train_dmatrix = dmatrices[TRAIN_DATASET_KEY]
        evals_result = {}

        init_model = None
        if self.resume_from_checkpoint:
            init_model, _ = self._load_checkpoint(self.resume_from_checkpoint)

        config.setdefault("verbose_eval", False)
        config.setdefault("callbacks", [])

        if not any(
            isinstance(
                cb, (self._tune_callback_report_cls, self._tune_callback_checkpoint_cls)
            )
            for cb in config["callbacks"]
        ):
            # Only add our own callback if it hasn't been added before
            checkpoint_frequency = (
                self.run_config.checkpoint_config.checkpoint_frequency
            )
            if checkpoint_frequency > 0:
                callback = self._tune_callback_checkpoint_cls(
                    filename=MODEL_KEY, frequency=checkpoint_frequency
                )
            else:
                callback = self._tune_callback_report_cls()

            config["callbacks"] += [callback]

        config[self._init_model_arg_name] = init_model

        model = self._train(
            params=self.params,
            dtrain=train_dmatrix,
            evals_result=evals_result,
            evals=[(dmatrix, k) for k, dmatrix in dmatrices.items()],
            ray_params=self._ray_params,
            **config,
        )

        checkpoint_at_end = self.run_config.checkpoint_config.checkpoint_at_end
        if checkpoint_at_end is None:
            checkpoint_at_end = True

        if checkpoint_at_end:
            self._checkpoint_at_end(model, evals_result)

xwjiang2010 · April 7, 2023, 6:11pm

Thanks for the context.
lightgbm-ray uses ray actors for distributed training. Each ray actor is running a lightgbm model.fit(). For this part, you can refer to lightgbm documentation. But roughly speaking, each actor is working on its own shard of evaluation dataset (local evaluation). So every actor will have different evaluation results. When combining the evaluation results, the logic treats all workers’ evaluation result the same instead of doing some mathematical average. I think this is why the result is not the same as what you would get by running .predict() using a checkpoint from a given booster round.

cc @Yard1 to confirm my understanding.

dbhdbh · April 10, 2023, 5:43pm

Thanks much!!
Is each actor’s evaluation result different? If they are different, which one will be used? Is there a way I can evaluate the model on the whole validation set in Ray tune (not a shard of it)?

Yard1 · April 11, 2023, 11:36am

We only report the results from rank 0 worker, but each actor should have the same evaluation set - we should be only sharding the training data, not evaluation data. If we are not doing that, this is a bug - I will need to double check the logic in LightGBMTrainer.

I am not sure whether the evaluation is done before or after the gradients are synced. If it is the latte and the data is the same on all workers, then all workers should return the same metrics (I think that is the case).

Topic		Replies	Views
How to report custom metric to tune while using lightgbm_ray? Ray Tune	7	776	November 12, 2021
Tune Performance Issue with LightGBM predict Ray Tune	0	274	November 7, 2022
Tune Performance issue with LightGBM predict Ray Tune	2	1168	November 10, 2022
Tuner.fit().get_best_result has no checkpoints (None) Ray Tune	4	603	August 26, 2024
Trouble with some results from Ray Tune	1	39	August 7, 2024

Ray tune metrics not consistent with offline evaluation

Related topics