Tuning XGBoost with PBT

I’m trying to tune a XGBoost model with the PBT schedule in Ray. For most it appears to be working well, though I have some questions:

xgb.train(
    config,
    train_set,
    evals=[(test_set, "eval")],
    verbose_eval=False,
    num_boost_round=n_estimators,
    callbacks=[TuneReportCheckpointCallback(filename="model.xgb")])

It appears from the implementation of TuneReportCheckpointCallback() that this does the whole job, including saving ‘step’ information. Restoring also seems to work “automagically” like a charm - if one have remembered to set ‘num_boost_round’. This could be stated more clearly in the documentation - if I have understood it correctly.

  1. Have I understood correctly?

However, I still have a few issues:

  1. As the perturbation multiply by 1.2 and the resulting new_config is not checked for the limits given in the hyperparameters_mutations, it is possible for a given hyperparameter to exceed its bounds resulting in:

    (pid=1445879) xgboost.core.XGBoostError: value 1.11478 for Parameter subsample exceed bound [0,1]
    (pid=1445879) subsample: Row subsample ratio of training instance.
    

    Is there a way to avoid that except excluding it?

  2. From time to time, I get

    2021-04-10 21:43:51,982	WARNING worker.py:1107 -- A worker died or was killed while executing task ffffffffffffffff76e3e21ca358acff9fa4f6a601000000.
    Result for train_af2db_00015:
    

    Why? Can I avoid that too?

Any insight is much appreciated. I have enclosed the whole code below - there might be other issues as well:

import sklearn.datasets
import sklearn.metrics
import numpy as np
import xgboost as xgb
import os
import random

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune.integration.xgboost import TuneReportCheckpointCallback


n_estimators = 500

# Load dataset
data = np.genfromtxt("trial.csv", usecols=range(0,8), delimiter=",",
                     names=['truecl','cep50', 'fiel', 'v0', 'acc', 'rcs'],
                     dtype=('S2', int, float, float, int, float))

le = preprocessing.LabelEncoder()
y = le.fit_transform(data['truecl'])
X = np.array([list(r)[2:] for r in data])

def train(config, checkpoint_dir=None):
    
    # Split into train and test set
    train_x, test_x, train_y, test_y = train_test_split(
        X, y, test_size=0.25)

    # Build input matrices for XGBoost
    train_set = xgb.DMatrix(train_x, label=train_y)
    test_set = xgb.DMatrix(test_x, label=test_y)

    # Train the classifier
    xgb.train(
        config,
        train_set,
        evals=[(test_set, "eval")],
        verbose_eval=False,
        num_boost_round=n_estimators,
        callbacks=[TuneReportCheckpointCallback(filename="model.xgb")])
        
    # Return prediction accuracy
#    accuracy = 1. - results["eval"]["merror"][-1]
#    tune.report(mean_accuracy=accuracy, done=True)
    
    
if __name__ == "__main__":
    config = {
        "objective": "multi:softmax",
        "num_class": 6,
        "eval_metric": ["mlogloss", "merror"],
        "max_depth": tune.randint(1, 9),
        "min_child_weight": tune.choice([1, 2, 3]),
        "gamma": tune.uniform(0.5, 5.0),
        "subsample": tune.uniform(0.5, 1.0),
        "colsample_bytree": tune.uniform(0.4, 1.0),
        "eta": tune.loguniform(1e-4, 1e-1),
        "learning_rate": tune.choice([1e-3, 1e-4, 1e-5]),
        "lambda": tune.uniform(0.1, 5.0),
        "alpha": tune.uniform(0.1, 5.0)
    }
    # This will enable aggressive early stopping of bad trials.
    scheduler = PopulationBasedTraining(
        time_attr='time_total_s',
#        metric='mean_accuracy',
        metric='eval-merror',
        mode='min',
        perturbation_interval=10,
        hyperparam_mutations={
            "max_depth": lambda: random.randint(1, 9),
            "min_child_weight": [1, 2, 3],
            "gamma": lambda: random.uniform(0.5, 5.0),
            "subsample": lambda: random.uniform(0.5, 1.0),
            "colsample_bytree": lambda: random.uniform(0.4, 1.0),
            "eta": lambda: random.uniform(1e-4, 1e-1),
            "learning_rate": [1e-3, 1e-4, 1e-5],
            "lambda": lambda: random.uniform(0.1, 5.0),
            "alpha": lambda: random.uniform(0.1, 5.0)
        })
    
    analysis = tune.run(
        train,
        scheduler=scheduler,
        resources_per_trial={"cpu": 1},
        config=config,
        num_samples=25)


    # Gets best trial based on max accuracy across all training iterations.
    best_trial = analysis.get_best_trial(metric="eval-merror", mode="min", scope="all") 
    # Gets best checkpoint for trial based on accuracy.
    best_checkpoint = analysis.get_best_checkpoint(best_trial,
                                                   metric='eval-merror',
                                                   mode='min')

    # Load the best model checkpoint
    best_bst = xgb.Booster()
    best_bst.load_model(os.path.join(best_checkpoint, "model.xgb"))
    accuracy = 1. - best_trial.last_result["eval-merror"]
    print(f"Best model parameters: {best_trial.config}")
    print(f"Best model total accuracy: {accuracy:.4f}")

Update to 1):

I’m not sure if this is a bug or a feature, but it appears that ‘num_boost round’ is not always honoured. It is not used as a limit on the total number of iterations, but iterations since ‘last restore’, which suggest that number of steps is not always recorded/restored correctly - unless this is by design:

When OK with ‘num boost_round’ = 2000:

+-------------------+------------+-------+------------+-------------+--------------------+--------+------------------+-----------------+---------------+
| Trial name        | status     | loc   |        eta |   max_depth |   min_child_weight |   iter |   total time (s) |   eval-mlogloss |   eval-merror |
|-------------------+------------+-------+------------+-------------+--------------------+--------+------------------+-----------------+---------------|
| train_acd4d_00000 | TERMINATED |       | 0.0527699  |           1 |                  2 |   2000 |          48.1875 |        0.102736 |      0.026549 |
| train_acd4d_00001 | TERMINATED |       | 0.0430121  |           1 |                  2 |   2000 |          47.9619 |        0.090751 |      0.028319 |
| train_acd4d_00002 | TERMINATED |       | 0.00659019 |           6 |                  2 |   2000 |          61.2121 |        0.087132 |      0.023009 |
| train_acd4d_00003 | TERMINATED |       | 0.0438558  |           7 |                  1 |   2000 |          52.4252 |        0.109443 |      0.028319 |
+-------------------+------------+-------+------------+-------------+--------------------+--------+------------------+-----------------+---------------+

When NOK with ‘num boost_round’ = 2000:

+-------------------+------------+-------+-----------+-------------+--------------------+--------+------------------+-----------------+---------------+
| Trial name        | status     | loc   |       eta |   max_depth |   min_child_weight |   iter |   total time (s) |   eval-mlogloss |   eval-merror |
|-------------------+------------+-------+-----------+-------------+--------------------+--------+------------------+-----------------+---------------|
| train_7191d_00000 | TERMINATED |       | 0.0912372 |           2 |                  2 |   2000 |          42.2566 |        0.088485 |      0.021239 |
| train_7191d_00001 | TERMINATED |       | 0.052294  |           8 |                  1 |   2000 |          44.7836 |        0.044939 |      0.014159 |
| train_7191d_00002 | TERMINATED |       | 0.0627527 |           9 |                  1 |   2884 |          56.0466 |        0.063437 |      0.021239 |
| train_7191d_00003 | TERMINATED |       | 0.0627527 |           7 |                  1 |   3786 |          64.832  |        0.087652 |      0.019469 |
+-------------------+------------+-------+-----------+-------------+--------------------+--------+------------------+-----------------+---------------+

Note the serial 00003.

This appears to be happening either because of a transfer:

Result for train_7191d_00001:
  date: 2021-04-11_20-20-06
  done: false
  eval-merror: 0.014159
  eval-mlogloss: 0.044831
  experiment_id: 64be541b15944bc7b4a756b933af81ad
  hostname: carbon
  iterations_since_restore: 1786
  node_ip: 192.168.1.115
  pid: 1555982
  should_checkpoint: true
  time_since_restore: 40.213419675827026
  time_this_iter_s: 0.5832500457763672
  time_total_s: 40.213419675827026
  timestamp: 1618165206
  timesteps_since_restore: 0
  training_iteration: 1786
  trial_id: 7191d_00001
  
2021-04-11 20:20:06,198	INFO pbt.py:532 -- [exploit] transferring weights from trial train_7191d_00001 (score -0.014159) -> train_7191d_00003 (score -0.035398)
2021-04-11 20:20:06,198	INFO pbt.py:549 -- [explore] perturbed config from {'max_depth': 8, 'min_child_weight': 1, 'eta': 0.05229395217030108} -> {'max_depth': 7, 'min_child_weight': 1, 'eta': 0.0627527426043613}
Result for train_7191d_00003:
  date: 2021-04-11_20-20-06
  done: false
  eval-merror: 0.035398
  eval-mlogloss: 0.141105
  experiment_id: 6d0e48a3dbf5438fa5c96e5aa673b271
  hostname: carbon
  iterations_since_restore: 1966
  node_ip: 192.168.1.115
  pid: 1555985
  should_checkpoint: true
  time_since_restore: 40.15677785873413
  time_this_iter_s: 0.1815929412841797
  time_total_s: 40.15677785873413
  timestamp: 1618165206
  timesteps_since_restore: 0
  training_iteration: 1966
  trial_id: 7191d_00003

…or a restore:

(pid=1555985) 2021-04-11 20:20:06,240|INFO trainable.py:371 -- Restored on 192.168.1.115 from checkpoint: /home/ft/ray_results/train_2021-04-11_20-19-24/train_7191d_00003_3_eta=0.045646,max_depth=7,min_child_weight=3_2021-04-11_20-19-24/checkpoint_tmp491d25/./
(pid=1555985) 2021-04-11 20:20:06,240|INFO trainable.py:379 -- Current state after restoring: {'_iteration': 1786, '_timesteps_total': None, '_time_total': 40.213419675827026, '_episodes_total': None}

+-------------------+------------+-----------------------+-----------+-------------+--------------------+--------+------------------+-----------------+---------------+
| Trial name        | status     | loc                   |       eta |   max_depth |   min_child_weight |   iter |   total time (s) |   eval-mlogloss |   eval-merror |
|-------------------+------------+-----------------------+-----------+-------------+--------------------+--------+------------------+-----------------+---------------|
| train_7191d_00001 | RUNNING    | 192.168.1.115:1555982 | 0.052294  |           8 |                  1 |   1943 |          43.6421 |        0.044885 |      0.014159 |
| train_7191d_00002 | RUNNING    | 192.168.1.115:1555983 | 0.0627527 |           9 |                  1 |   1948 |          43.5346 |        0.063725 |      0.021239 |
| train_7191d_00003 | RUNNING    | 192.168.1.115:1555985 | 0.0627527 |           7 |                  1 |   1920 |          43.4511 |        0.081179 |      0.019469 |
| train_7191d_00000 | TERMINATED |                       | 0.0912372 |           2 |                  2 |   2000 |          42.2566 |        0.088485 |      0.021239 |
+-------------------+------------+-----------------------+-----------+-------------+--------------------+--------+------------------+-----------------+---------------+

Result for train_7191d_00003:
  date: 2021-04-11_20-20-11
  done: false
  eval-merror: 0.019469
  eval-mlogloss: 0.079611
  experiment_id: 64be541b15944bc7b4a756b933af81ad
  hostname: carbon
  iterations_since_restore: 235
  node_ip: 192.168.1.115
  pid: 1555985
  time_since_restore: 4.9278483390808105
  time_this_iter_s: 0.008939027786254883
  time_total_s: 45.14126801490784
  timestamp: 1618165211
  timesteps_since_restore: 0
  training_iteration: 2021
  trial_id: 7191d_00003

:

Result for train_7191d_00003:
  date: 2021-04-11_20-20-30
  done: true
  eval-merror: 0.019469
  eval-mlogloss: 0.087652
  experiment_id: 64be541b15944bc7b4a756b933af81ad
  experiment_tag: 3_eta=0.045646,max_depth=7,min_child_weight=3@perturbed[eta=0.062753,max_depth=7,min_child_weight=1]
  hostname: carbon
  iterations_since_restore: 2000
  node_ip: 192.168.1.115
  pid: 1555985
  time_since_restore: 24.618607759475708
  time_this_iter_s: 0.005644798278808594
  time_total_s: 64.83202743530273
  timestamp: 1618165230
  timesteps_since_restore: 0
  training_iteration: 3786
  trial_id: 7191d_00003

This means that depending on how this is actually is progressing, and depending on how the transfer/restore happens during the schedule, it could be running indefinitely. It resembles a data race…

Hi Frode,

I’ll try to shed some light on all of your questions.

  1. The TuneReportCheckpointCallback does take care of checkpointing, but it does not automatically take care of restoring! I just saw that we don’t provide a good example for this in the docs, yet - I’ll see that we update this at some point (tracked here: [tune] Make sure we have checkpoint/restore examples in the docs (e.g. for xgboost) · Issue #15244 · ray-project/ray · GitHub)

This means you’ll have to specify to restore from a checkpoint specifically:

def train(config, checkpoint_dir=None):
    model_file = None
    if checkpoint_dir:
        model_file = os.path.join(checkpoint_dir, "model.xgb")
    xgb.train(
        config,
        train_set,
        evals=[(test_set, "eval")],
        verbose_eval=False,
        num_boost_round=n_estimators,
        xgb_model=model_file,
        callbacks=[TuneReportCheckpointCallback(filename="model.xgb")])

However, as you noted in the update, this does not adjust the number of boosting rounds.
For the update, you’re seeing these results because the trial already trained for a number of iterations (say 1786) before it exploits another trial. It then starts to run for another 2000 boosting rounds again, leading to a total number of 3786 iterations. Note though that the actual number of trained boosting rounds might yet be different: If trial 1 trained for 1500 iterations and trial 3 for 1786 iterations before it exploits trials 1, it will have trained for 1500+2000 = 3500 iterations at the end.

Usually this problem is avoided by including the number of boosting rounds in the checkpoint. In xgboost you could try something like this:

bst = xgb.Booster(model_file)
boosted_rounds = bst.num_boosted_rounds()
rounds_left = 2000 - boosted_rounds
xgb.train(
   # ...,
   num_boost_round=rounds_left)
  1. You can pass a custom_explore_fn to the PopulationBasedTraining scheduler (see Trial Schedulers (tune.schedulers) — Ray v2.0.0.dev0). This is a function that takes the generated config as an input and you can modify the elements. Here you can e.g. limit the subsample parameter to 1:
custom_explore_fn(config):
    config["subsample"] = min(1., config["subsample"])
    return config

scheduler = PopulationBasedTraining(
    # ...,
    custom_explore_fn=custom_explore_fn)
  1. It’s hard to say what is happening there. Do you see any other logs? Are you running on a preemptible cluster? If it dies silently, this might indicate an OOM error on the xgboost side.

Hello Kai,

Thank you so much for your reply. That was really helpful and it now actually works (almost) as expected. Let me answer your replies one by one.

  1. Initially I did try the checkpoint_dir check (as noted in other examples), but it was never called in my original tests. That’s why I thought everything was hidden inside the scheduler. That and because the result was so good. :slight_smile: Good that the documentation is updated as it is a bit obtuse.

  2. It’s so obvious when things are spelled out to you. Thank you.

  3. I freed up some memory and that particular warning is now gone.

However, other issues have appeared. Let me know if you want them somewhere else:

  1. From time to time I get these warnings:

    2021-04-13 18:06:16,842	WARNING trial_runner.py:420 -- Trial Runner checkpointing failed: Checkpoint must not be in-memory.
    

    and:

    2021-04-13 20:35:28,915	INFO pbt.py:481 -- [pbt]: no checkpoint for trial. Skip exploit for Trial train_cc285_00005
    

    Are they relevant for the result of the tuning?

  2. With synch=True (and time_attr=‘time_total_s’ it appears), I get:

    Traceback (most recent call last):
      File "/home/ft/HIOF/master/sk_xgb1_pbt.py", line 97, in <module>
        analysis = tune.run(
      File "/usr/local/lib/python3.8/dist-packages/ray/tune/tune.py", line 421, in run
        runner.step()
      File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 404, in step
        self.trial_executor.on_no_available_trials(self)
      File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_executor.py", line 192, in on_no_available_trials
        raise TuneError("There are paused trials, but no more pending "
    ray.tune.error.TuneError: There are paused trials, but no more pending trials with sufficient resources.
    

But there are workers available:

+-------------------+------------+-----------------------+------------+-------------+--------------------+--------+------------------+-----------------+---------------+
| Trial name        | status     | loc                   |        eta |   max_depth |   min_child_weight |   iter |   total time (s) |   eval-mlogloss |   eval-merror |
|-------------------+------------+-----------------------+------------+-------------+--------------------+--------+------------------+-----------------+---------------|
| train_1b029_00006 | RUNNING    | 192.168.1.115:1750736 | 0.027682   |           3 |                  1 |     41 |          44.3129 |        0.574313 |      0.044248 |
| train_1b029_00007 | RUNNING    | 192.168.1.115:1750734 | 0.0704629  |           3 |                  2 |     40 |          44.3141 |        0.21976  |      0.042478 |
| train_1b029_00008 | RUNNING    | 192.168.1.115:1750873 | 0.0730381  |           1 |                  2 |     42 |          19.6895 |        0.396751 |      0.077876 |
| train_1b029_00009 | RUNNING    | 192.168.1.115:1750906 | 0.00214736 |           5 |                  2 |      4 |           6.7107 |        1.76858  |      0.031858 |
| train_1b029_00000 | PAUSED     |                       | 0.0190437  |           8 |                  1 |     22 |          46.2494 |        1.04767  |      0.033628 |
| train_1b029_00001 | PAUSED     |                       | 0.074065   |           5 |                  1 |     26 |          45.8214 |        0.249421 |      0.015929 |
| train_1b029_00002 | PAUSED     |                       | 0.0489521  |           8 |                  2 |     22 |          45.2608 |        0.520015 |      0.015929 |
| train_1b029_00003 | PAUSED     |                       | 0.0650885  |           4 |                  3 |     32 |          46.2763 |        0.244538 |      0.023009 |
| train_1b029_00004 | TERMINATED |                       | 0.0555072  |           1 |                  1 |     50 |          22.4492 |        0.473362 |      0.157522 |
| train_1b029_00005 | TERMINATED |                       | 0.0640835  |           2 |                  1 |     50 |          34.503  |        0.23221  |      0.060177 |
+-------------------+------------+-----------------------+------------+-------------+--------------------+--------+------------------+-----------------+---------------+
  1. The number of iterations does not always exactly match the specified ‘num_boost_round’ (I assume they count the same) at the end of the tuning. Probably not a big deal in real-world applications, but in in many cases in research this is not desired behaviour, e.g. when comparing results from different methods. Is there a way to make this accurate? Also checked with synch=True.

  2. Upgrading to xgboost to 1.4.0 (due to access to ‘num_boosted_rounds()’) resulted in a increase in training time. From less than one second per 100 iterations to >150 seconds with 1.3.0rc1 and newer. This is only apparent with Ray as far as I can tell. Am I doing something wrongly? Calling ‘train’ vs. calling ‘XGBClassifier’ maybe?

Thanks again!

Update on

  1. It seems that num_boosted_rounds() does not always return multiples of perturbation_interval.

Update II on

  1. synch=True in the scheduler and stop={"training_iteration": n_estimators}, in the tune.run() call seem to do consistently what I want/need. Something is still off: For num_boosted_rounds() I got 21, 42 and 63 while I expect 25, 50 and 75 - 4 off for each iteration.

Before I found out the above, I did a complete “walk through” with commenting and formatting of a run, but the forum complained it was too large. :smiley: I have included it in a Google Docs below since Pastebin also complained. Feel free to peruse - or not :smiley: :

Hi Frode, the number of iterations can be different from the number of boosting rounds - especially if you don’t restore the current iteration count from a checkpoint. The default callback just reports after each iteration, and these iterations are increased by 1 each time. Thus you can end up with more iterations than boosting rounds if the trials was restored often.

Thanks again @kai! Not sure I follow you now. I’m using synch=True and I therefore thought that it was synchronised in “both ends”. Since all checkpoints happens at the same epoch time, the restoration is also happening at the same epoch time. Or?

Ah - yes that seems right. However, there’s a thing: Per default the TuneReportCheckpointCallback reports results each iteration, but only saves a checkpoint every 5 iterations (you can set this with the frequency parameter). So this is how a mismatch can come up.

In your case the following seems to be happening. XGBoost counts its epoch starting at 0, but Tune counts steps starting at 1. At iteration 5 (Tune), XGBoost is at epoch 4, so no checkpoint is saved (as 4 % 5 != 0). Only at iteration 6 (Tune) the checkpoint will be saved - so after 6 boosting rounds (even though xgboost reports the epoch as 5).

So at step 25, the latest available checkpoint is from iteration 21. If you then exploit a trial, it starts with 21 boost trees. It then trains for another 25 iterations - but in XGBoost terms this again will be 21 boost rounds until the next checkpoint. So you end up with multiples of 21 instead of 25.

The solution for you here is to use TuneReportCheckpointCallback(frequency=1).

On our end, I’ll think about whether we should set the default checkpoint frequency to 1 (which would write checkpoints very often) or adjust XGBoost’s epoch counting to coincide with Tune’s step counting (i.e. use epoch + 1 instead of epoch)