How to set metric?

I’m trying to setup ray.tune in Kaggle kernel. It kind of works, but I’m not sure that I do it right, I’m getting a bunch of warnings. Could you please clarify some questions?

  1. There are several places where I can set a metric: inside xgb.train, config, search_alg, scheduler or tune.run. Where should I set a metric if I’d like to use logloss in this case? Where should I set mode="min" for the metric? Documentation says that the metric from tune.run will be passed to the search algorithm and scheduler. But what’s about xgb.train, the warning says it uses the default?

  2. How to force tune.run to use cross-validation when calculating the metric?

  3. How can I fix this warning: Parameters: { "n_estimators" } might not be used.?

  4. If I’d like to use all 4 available CPU cores is it enough to set max_concurrent=4 and resources_per_trial={"cpu": 1}? Or do I need to add anything else?

  5. Does placing pd.read_csv inside load_and_train function mean that every train call will read the data again?

%%time
def load_and_train(config: dict):
    data = pd.read_csv('/kaggle/input/bioresponse/train.csv')
    labels = data['Activity']
    data = data.drop(['Activity'], axis=1)
    
    train_x, test_x, train_y, test_y = train_test_split(
        data, labels,
        test_size=0.2,
        random_state=seed,
        stratify=labels
    )
    
    train_set = xgb.DMatrix(train_x, label=train_y)
    test_set = xgb.DMatrix(test_x, label=test_y)
    
    xgb.train(
        config,
        train_set,
        evals=[(test_set, 'eval')],
        verbose_eval=False,
        callbacks=[TuneReportCallback()]
    )

seed = 0
search_space = {
    'n_estimators': tune.randint(100, 1101),
    "objective": "binary:logistic",
    "max_depth": tune.randint(1, 9),
    "min_child_weight": tune.choice([1, 2, 3]),
    "subsample": tune.uniform(0.5, 1.0),
    "eta": tune.loguniform(1e-4, 1e-1)
}

analysis = tune.run(
    load_and_train,
    num_samples=10,
    metric="eval-logloss",
    mode="min",
    config=search_space,
    search_alg=HEBOSearch(
        random_state_seed=seed,
        max_concurrent=4
    ),
    scheduler=ASHAScheduler(
        max_t=10,
        grace_period=1,
        reduction_factor=2
    ),
    resources_per_trial={"cpu": 1},
    local_dir='xgboost',
    verbose=2
)

Is there anybody alive?

Hey @asin,

Maybe you’d find an easier time by using Tune-sklearn (scikit-learn API for ray tune) GitHub - ray-project/tune-sklearn: A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques. ?

here’s an example:

  1. it will help simplify the metrics definition
  2. It has cross-validation as implemented
  3. Not sure about the n_estimators parameter warning
  4. Data in tune-sklearn will automatically be dumped into the shared memory object store. So you only need to read once.

Thank you for your answer.
As far as I know Tune-sklearn does not support HEBOSearch. That’s why I have to use tune.run like this. I’ve read all the documentation but did not find cross-validation settings (shuffle, random_state, stratify, n_splits). Where can I find them?

It should support HEBOSearch now: tune-sklearn/custom_searcher_example.py at master · ray-project/tune-sklearn · GitHub