Model training is slower in Ray Tune

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi everyone, I’m trying to use Ray Tune and notice the same model training code is abysmally slow compared to when not using it. My XGBoost training takes <0.3s / iteration when running normally and 15s / iteration (!) when using Ray Tune.

Any help would be really appreciated! Here’s a reproducible example:

from time import time

import ray
import numpy as np
import pandas as pd
import xgboost as xgb
from ray.tune.search import ConcurrencyLimiter
from ray.tune.search.hyperopt import HyperOptSearch


def train_xgb(config, data=None, base_params=None):
    # validation_0 refers to the first tuple of data passed to `eval_set`, in this case
    # `validation`
    tic = time()
    model = (
        xgb.XGBClassifier(**(base_params | config))
        .fit(
            data['train']['features'], data['train']['labels'],
            eval_set=[(data['validation']['features'], data['validation']['labels'])],
            verbose=True,
        )
    )
    print(time() - tic)

    results = model.evals_result()

    return {f'validation-{metric}': values[-1] for metric, values in results['validation_0'].items()}


def train_model(data, tune_hyperparameters):
    # Data
    data = {
        'train': {'features': np.random.rand(4000000, 30), 'labels': np.random.randint(0, 2, 4000000)},
        'validation': {'features': np.random.rand(400000, 30), 'labels': np.random.randint(0, 2, 400000)},
    }

    base_params = {
        'objective': 'binary:logistic',
        'eval_metric': ['logloss', 'auc'],
        'random_state': 0,
        'n_jobs': -1,
        "validate_parameters": True,
        "verbosity": 1,
    }

    if tune_hyperparameters:
        best_hparams = tune_model(
            data,
            base_params,
        )

    else:
        best_hparams = {
            'n_estimators': 10,
            'learning_rate': 0.013921520897736126,
            'grow_policy': 'depthwise',
            'tree_method': 'hist',
            'max_depth': 13,
            'scale_pos_weight': 4.997334464166717,
        }

        best_model = train_xgb(config=best_hparams, data=data, base_params=base_params)


def tune_model(data, base_params):
    run_config = ray.air.RunConfig(
        verbose=0,
    )

    # Dummy
    param_space = {
        'n_estimators': ray.tune.randint(100, 400),
        # 'learning_rate': ray.tune.loguniform(0.001, 0.1),
        # 'grow_policy': ray.tune.choice(['depthwise', 'lossguide']),
        # 'tree_method': ray.tune.choice(['approx', 'hist']),
        # 'max_depth': ray.tune.randint(2, 15),
        # 'scale_pos_weight': 1 / data['train']['labels'].mean() - 1,
    }

    tune_config = ray.tune.TuneConfig(
        search_alg=ConcurrencyLimiter(
            searcher=HyperOptSearch(
                random_state_seed=0,
                points_to_evaluate=[{
                    'n_estimators': 10,
                    'learning_rate': 0.013921520897736126,
                    'grow_policy': 'depthwise',
                    'tree_method': 'hist',
                    'max_depth': 13,
                    'scale_pos_weight': 4.997334464166717,
                }],
            ),
            max_concurrent=1,
        ),
        num_samples=1,
        metric='validation-auc',
        mode='max',
        reuse_actors=True,
    )

    # Note the difference in how parameters are passed into XGBoostTrainer when training
    # and tuning

    tuner = ray.tune.Tuner(
        ray.tune.with_parameters(train_xgb, data=data, base_params=base_params),
        param_space=param_space,
        run_config=run_config,
        tune_config=tune_config,
    )

    results = tuner.fit()

    best_hparams = results.get_best_result().config

    return best_hparams

# XGBoost - 0.5s / iteration
train_model(None, tune_hyperparameters=False)

# Ray Tune - 15s / iteration
train_model(None, tune_hyperparameters=True)

Hi @bobballand,

You may need to set the resources to the number of CPUs on your machine in this case. Xgboost uses the [OMP_NUM_THREADS environment variable] (python - Limiting the number of threads used by XGBoost - Stack Overflow) to determine the parallelism, which gets set by Ray when creating task with a certain number of CPUs assigned to it. Ray Tune defaults to using 1 CPU.

See this guide on how to set the resources per trial (only 1 trial in your case): A Guide To Parallelism and Resources for Ray Tune — Ray 2.5.0

@bobballand Did you get all your answers from the docs and links provided by @justinvyu

That did not do anything.

I’ve also tried setting OMP_NUM_THREADS=32 / OMP_NUM_THREADS=1000, but only managed to get the time per iteration down to ~3s instead of 0.3s.

Here’s an experiment comparing No Ray vs. Ray Tune vs. just running as a Ray Core task.

No ray:
[0]     validation_0-logloss:0.69312    validation_0-auc:0.49626
[1]     validation_0-logloss:0.69316    validation_0-auc:0.49657
[2]     validation_0-logloss:0.69331    validation_0-auc:0.49500
[3]     validation_0-logloss:0.69352    validation_0-auc:0.49461
[4]     validation_0-logloss:0.69380    validation_0-auc:0.49595
[5]     validation_0-logloss:0.69416    validation_0-auc:0.49558
[6]     validation_0-logloss:0.69458    validation_0-auc:0.49587
[7]     validation_0-logloss:0.69506    validation_0-auc:0.49537
[8]     validation_0-logloss:0.69560    validation_0-auc:0.49654
[9]     validation_0-logloss:0.69619    validation_0-auc:0.49668
2.795750141143799

-----------
Ray Tune
(train_xgb pid=27587) [0]       validation_0-logloss:0.69312    validation_0-auc:0.50101
(train_xgb pid=27587) [1]       validation_0-logloss:0.69318    validation_0-auc:0.50037
(train_xgb pid=27587) [2]       validation_0-logloss:0.69328    validation_0-auc:0.49800
(train_xgb pid=27587) [3]       validation_0-logloss:0.69333    validation_0-auc:0.49914
(train_xgb pid=27587) [4]       validation_0-logloss:0.69335    validation_0-auc:0.49710
(train_xgb pid=27587) [5]       validation_0-logloss:0.69334    validation_0-auc:0.49788
(train_xgb pid=27587) [6]       validation_0-logloss:0.69344    validation_0-auc:0.49723
(train_xgb pid=27587) [7]       validation_0-logloss:0.69352    validation_0-auc:0.49651
(train_xgb pid=27587) [8]       validation_0-logloss:0.69360    validation_0-auc:0.49506
(train_xgb pid=27587) [9]       validation_0-logloss:0.69368    validation_0-auc:0.49374
(train_xgb pid=27587) 9.086719751358032

-----------
Ray Core task:
(train_model_ray pid=35238) [0] validation_0-logloss:0.69315    validation_0-auc:0.49470
(train_model_ray pid=35238) [1] validation_0-logloss:0.69326    validation_0-auc:0.49665
(train_model_ray pid=35238) [2] validation_0-logloss:0.69343    validation_0-auc:0.49713
(train_model_ray pid=35238) [3] validation_0-logloss:0.69366    validation_0-auc:0.49719
(train_model_ray pid=35238) [4] validation_0-logloss:0.69397    validation_0-auc:0.49850
(train_model_ray pid=35238) [5] validation_0-logloss:0.69434    validation_0-auc:0.49868
(train_model_ray pid=35238) [6] validation_0-logloss:0.69478    validation_0-auc:0.50013
(train_model_ray pid=35238) [7] validation_0-logloss:0.69529    validation_0-auc:0.50062
(train_model_ray pid=35238) [8] validation_0-logloss:0.69584    validation_0-auc:0.50235
(train_model_ray pid=35238) [9] validation_0-logloss:0.69646    validation_0-auc:0.50358
(train_model_ray pid=35238) 2.599735736846924

I ran this a few times to make sure the results are consistent.

So, No ray ~= just using Ray core. But, adding in Ray Tune is the cause of the slowdown. I have a few hypotheses, but will need to investigate a bit more.

cc: @kai Looks like this is the same thread as here: Slack

Misc

Code for running as a ray task:

@ray.remote(num_cpus=8)
def train_model_ray():
    # Data
    data = {
        "train": {
            "features": np.random.rand(400000, 30),
            "labels": np.random.randint(0, 2, 400000),
        },
        "validation": {
            "features": np.random.rand(40000, 30),
            "labels": np.random.randint(0, 2, 40000),
        },
    }

    base_params = {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "auc"],
        "random_state": 0,
        "n_jobs": -1,
        "validate_parameters": True,
        "verbosity": 1,
    }

    best_hparams = {
        "n_estimators": 10,
        "learning_rate": 0.013921520897736126,
        "grow_policy": "depthwise",
        "tree_method": "hist",
        "max_depth": 13,
        "scale_pos_weight": 4.997334464166717,
    }

    best_model = train_xgb(config=best_hparams, data=data, base_params=base_params)


ray.get(train_model_ray.remote())

EDIT: Removed this answer as it was faulty. See below for correct answer.

I’ve looked further into this and it looks like the config is not correctly passed. This is because the HyperOptSearcher only outputs parameters to evaluate that match its search space.

I.e. if you change this:

    param_space = {
        'n_estimators': ray.tune.randint(100, 400),
        # 'learning_rate': ray.tune.loguniform(0.001, 0.1),
        # 'grow_policy': ray.tune.choice(['depthwise', 'lossguide']),
        # 'tree_method': ray.tune.choice(['approx', 'hist']),
        # 'max_depth': ray.tune.randint(2, 15),
        # 'scale_pos_weight': 1 / data['train']['labels'].mean() - 1,
    }

into

    param_space = {
        'n_estimators': ray.tune.randint(100, 400),
        'learning_rate': ray.tune.loguniform(0.001, 0.1),
        'grow_policy': ray.tune.choice(['depthwise', 'lossguide']),
        'tree_method': ray.tune.choice(['approx', 'hist']),
        'max_depth': ray.tune.randint(2, 15),
        'scale_pos_weight': 1 / data['train']['labels'].mean() - 1,
    }

and set the OMP_NUM_THREADS variable, you’ll end up with the same speed.

The technical reason here seems to be the parameters, which have an impact on the training speed.

The pure xgboost got the following params:

{'objective': 'binary:logistic', 'eval_metric': ['logloss', 'auc'], 'random_state': 0, 'n_jobs': -1, 'validate_parameters': True, 'verbosity': 1, 'n_estimators': 10, 'learning_rate': 0.013921520897736126, 'grow_policy': 'depthwise', 'tree_method': 'hist', 'max_depth': 13, 'scale_pos_weight': 4.997334464166717}

but without the search space, the tune-xgboost got only these params:

{'objective': 'binary:logistic', 'eval_metric': ['logloss', 'auc'], 'random_state': 0, 'n_jobs': -1, 'validate_parameters': True, 'verbosity': 1, 'n_estimators': 10}

Thanks @kai for jumping in and looking into it.

Thanks @kai. I thought by setting points_to_evaluate in HyperOptSearch, that point would have been evaluated first, regardless of the hyperparameter set.