I’m trying to setup ray.tune
in Kaggle kernel. It kind of works, but I’m not sure that I do it right, I’m getting a bunch of warnings. Could you please clarify some questions?
-
There are several places where I can set a metric: inside
xgb.train
,config
,search_alg
,scheduler
ortune.run
. Where should I set a metric if I’d like to uselogloss
in this case? Where should I setmode="min"
for the metric? Documentation says that the metric fromtune.run
will be passed to the search algorithm and scheduler. But what’s aboutxgb.train
, the warning says it uses the default? -
How to force
tune.run
to use cross-validation when calculating the metric? -
How can I fix this warning:
Parameters: { "n_estimators" } might not be used.
? -
If I’d like to use all 4 available CPU cores is it enough to set
max_concurrent=4
andresources_per_trial={"cpu": 1}
? Or do I need to add anything else? -
Does placing
pd.read_csv
insideload_and_train
function mean that every train call will read the data again?
%%time
def load_and_train(config: dict):
data = pd.read_csv('/kaggle/input/bioresponse/train.csv')
labels = data['Activity']
data = data.drop(['Activity'], axis=1)
train_x, test_x, train_y, test_y = train_test_split(
data, labels,
test_size=0.2,
random_state=seed,
stratify=labels
)
train_set = xgb.DMatrix(train_x, label=train_y)
test_set = xgb.DMatrix(test_x, label=test_y)
xgb.train(
config,
train_set,
evals=[(test_set, 'eval')],
verbose_eval=False,
callbacks=[TuneReportCallback()]
)
seed = 0
search_space = {
'n_estimators': tune.randint(100, 1101),
"objective": "binary:logistic",
"max_depth": tune.randint(1, 9),
"min_child_weight": tune.choice([1, 2, 3]),
"subsample": tune.uniform(0.5, 1.0),
"eta": tune.loguniform(1e-4, 1e-1)
}
analysis = tune.run(
load_and_train,
num_samples=10,
metric="eval-logloss",
mode="min",
config=search_space,
search_alg=HEBOSearch(
random_state_seed=seed,
max_concurrent=4
),
scheduler=ASHAScheduler(
max_t=10,
grace_period=1,
reduction_factor=2
),
resources_per_trial={"cpu": 1},
local_dir='xgboost',
verbose=2
)