Checkpointing using the Trainable Class Api and Xgboost

Hello,

I am attempting to use ray tune with xgboost using the Trainable Class API and am having some trouble with how to implement check pointing.

Here is the code I have so far for the trainable class.

class ANTrainable(Trainable):

    def setup(self, config, data_obj_id=None):
        df = data_obj_id
        length = len(df)
        len_test = int(0.1 * length)
        len_train = length - len_test
        
        df_train = df.head(len_train)
        df_train = shuffle(df_train)
        y_train = [1 if val > 0 else 0 for val in df_train['counts'].values]
        y_train = np.array(y_train)
        
        df_train = df_train.drop(['datetime', 'counts'], axis=1)
        X_train = df_train.values
                            
        df_test = df.tail(len_test)
        y_test = [1 if val > 0 else 0 for val in df_test['counts'].values]
        y_test = np.array(y_test)
        self.y_test = y_test

        df_test = df_test.drop(['datetime', 'counts'], axis=1)
        X_test = df_test.values
        self.X_test = X_test
        self.config = config
        self.train_set = xgboost.DMatrix(X_train, y_train)
        self.test_set = xgboost.DMatrix(X_test, y_test)
        self.model = xgboost.Booster()
   
    def reset_config(self, new_config):
        self.config = new_config
        return True

    def step(self):
        evals_result = {}
        bst = xgboost.train(
            self.config,
            self.train_set,
            evals=[(self.test_set, "eval")],
            evals_result = evals_result,
            verbose_eval=False,
        )

        self.model = bst

        return {
            'aucpr': evals_result['eval']['aucpr'][-1],
            'auc': evals_result['eval']['auc'][-1],
            'logloss': evals_result['eval']['logloss'][-1],
            'error': evals_result['eval']['error'][-1],
                }

    def save_checkpoint(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.xgb")
        self.model.save_model(checkpoint_path)
        return tmp_checkpoint_dir

    def load_checkpoint(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.xgb")
        bst = xgboost.Booster()
        self.model = bst.load_model(checkpoint_path)        

And for the tune.run and scheduler

          self.config = {
                "tree_method": "hist",
                "objective": "binary:logistic",
                "eval_metric": ["aucpr", "auc", "logloss", "error"],
                "eta": tune.loguniform(1e-4, 1),
                "subsample": tune.uniform(0.1, 1.0),
                "colsample_bytree": tune.uniform(0.1, 1.0),
                "max_depth": tune.randint(3,10), 
                "gamma": tune.loguniform(0.01, 1),
                "min_child_weight": tune.uniform(1, 7),        
        }
        
        print('Running Tune Step')
        
        self.analysis = tune.run(
            tune.with_parameters(ANTrainable, data_obj_id=self.an_model_input_data),
            reuse_actors=True,
            metric="logloss",
            mode="min",
            config=self.config,
            num_samples=100,
            max_failures=10,
            checkpoint_freq=10,
            scheduler = ASHAScheduler(
                max_t=10,  # training iterations
                grace_period=1,
                reduction_factor=2
            )
        )

I feel like there should be a way to take advantage of the Tune Xgboost Callbacks but I’m not entirely sure of how to go about it.

Any help would be much appreciated, and if anyone has an example that would be even better!!

Hi @nikhil, the Tune Xgboost Callbacks actually only work with Tune’s functional API and I would recommend using that over the class API. It’ll require just some minor refactoring of your code but instead of the ANTrainable class you would have a function instead like def train(config, checkpoint_dir, data_object_id)

Then when you call xgb.train, you can pass in the TuneReportCheckpointCallback just like in this guide Tuning XGBoost parameters — Ray v1.4.1.

Let me know if this works for you or if you have any other questions!

Hiya, @amogkam

Thanks for the response! I was able to get this to work with the TuneReportCheckpointCallbacks, as you suggested above. The only issue I’m facing now is that the setup of each trial is taking a very long time, so I am looking to use the ‘reuse_actors=True’ option to avoid that overhead.

When I tried just turning that option on while using the function api, as long as there were enough resources for each trial it seemed to work, though it would fail every once in a while because the reset_config option is not implemented. I was under the impression that the reuse_actors config would not work unless a reset_config option was implemented in the class API. Is having enough resources for each trial and only running on one node the reason this works?

I was able to get a class api version to work, but the jobs were running significantly slower for the same settings on the same machine.

Best,

Nikhil