Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I am using TuneGridSearchCV to tune an XGBoost model of significant size on my university’s HPC cluster. I get errors saying “Processing trial results took X s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.” How can I report less frequently for TuneGridSearchCV? The results are in ray_results in my home directories which is very slow to write to. Our work directories are very fast so it might also help if I can move ray_results to a different directory.
I tried setting early_stopping = True which speeds up the process and removes the bottleneck but then I get a warning saying “UserWarning: early_stopping is enabled but max_iters = 1. To enable partial training, set max_iters > 1.” and “UserWarning: tune-sklearn implements incremental learning for xgboost models following this: incremental learning, partial_fit like sklearn? · Issue #1686 · dmlc/xgboost · GitHub. This may negatively impact performance. To disable, set early_stopping=False
.” I’ve been testing on smaller parameter spaces but if I expand to the full parameter space I’m afraid it will slow down (although I don’t understand the max_iter warning as that shouldn’t apply to TuneGridSearchCV).
Essentially it is the reporting that is causing my model to slow down and fail due to running out of storage in the directory that ray_results is in so I have 3 questions (and please understand I am quite new to all this, I tried looking in the docs but I don’t exactly understand what to do):
- How can I report less frequently?
- How can I change the ray_results folder from ~/ray_results to something else?
- Should I implement early stopping despite the warnings?
I would really appreciate answers to any/all 3! Thank you!
import numpy as np
import pandas as pd
from pandas import MultiIndex, Int16Dtype
from sklearnex import patch_sklearn #these lines MUST come before importing any sklearn packages
patch_sklearn()
import xgboost as xgb
from tune_sklearn import TuneGridSearchCV
from datetime import datetime
import sys
if __name__ == "__main__":
df_train = pd.read_excel('my_dataset.xlsx', sheet_name = 'Train')
train_cols = df_train.columns[df_train.columns != 'Response']
X_train = pd.DataFrame(df_train, columns=train_cols)
y_train = pd.DataFrame(df_train, columns=['Response'])
params = {
"n_estimators" : list(range(100, 800, 100)),
"max_depth" : list(range(2, 12, 2)), #range: [0,∞] from document
"min_child_weight" : list(range(2, 12, 2)) #range: [0,∞] from document
#"gamma" : np.arange(0, 1.05, 0.1), #range: [0,∞] from document
#"colsample_bytree" : np.arange(0.5, 1.05, 0.1), #range: [0,1] from document
#"colsample_bylevel" : np.arange(0.5, 1.05, 0.1), #range: [0,1] from document
#'reg_lambda': [0.1, 1.0, 5.0, 10.0, 25.0, 50.0]
}
xgb_model = xgb.XGBClassifier(seed=0, use_label_encoder = False, tree_method = 'hist')
print("Tree method = hist, 74 cores, a100 gpu, 3 gpus per, hist, early stopping")
print(params)
grid_cv = TuneGridSearchCV(xgb_model, param_grid = params, early_stopping = True, cv = 5, n_jobs = -1, scoring='roc_auc', use_gpu=True)
current_time = datetime.now().strftime("%H:%M:%S")
print("Start Time =", current_time)
print('\n\n')
grid_cv.fit(X_train, y_train.values.ravel())
current_time = datetime.now().strftime("%H:%M:%S")
print('End Time: ', current_time)
print('\n\n')
print('Grid best score (roc_auc): ')
print(grid_cv.best_score_)
print('\n\n')
print('Grid best hyperparameters: ')
print(grid_cv.best_params_)
print('\n\n')