Ray tune performance decreases with more CPUs per trial

Hi all! In some code I wrote recently, trials are taking longer to execute as I allocate more cpus_per_trial.

I created the script below to reproduce the behavior I’m seeing. I am doing grid search on hyperparameters for Bayesian optimization algorithms (meta I know!).

As you can see in my notes, the code executes quickly in local_mode (same as if I ran the script without tune). However, when I allocate an equivalent number of cpus per trial in cluster mode (4 on my Macbook Pro), the code takes more than 3x the time to execute!

Any ideas on what might be going on here?

""""
1. Install dependencies: `pip install botorch ray[default] numpy`
2. Run script in local mode: `python reproduce_botorch.py --local_mode`
3. Run script on normal cluster: `python reproduce_botorch.py --cpus_per_trial <CPUS>`

Running on 2015 Macbook Pro (Intel i5, 16G RAM):
- Local mode: 8.75 seconds
- Cluster mode with 1 cpu per trial: 17.06 seconds
- Cluster mode with 2 cpus per trial: 20.22 seconds
- Cluster mode with 3 cpus per trial: 32.47 seconds
- Cluster mode with 4 cpus per trial: 30.85 seconds
"""

import numpy as np
import torch
from ray import tune
import ray
from botorch.models import MultiTaskGP
from botorch.fit import fit_gpytorch_model
from gpytorch.mlls.exact_marginal_log_likelihood import (
    ExactMarginalLogLikelihood,
)
from botorch.acquisition import ExpectedImprovement as EI
from botorch.optim import optimize_acqf
import argparse


def main(local_mode, cpus_per_trial):

    ray.init(local_mode=local_mode)
    tune.run(
        trainable, num_samples=10, resources_per_trial={"cpu": int(cpus_per_trial)}
    )


def trainable(config):
    X, y = generate_data(noise=20, n_t1=50, n_t2=5)

    # Train model
    model = MultiTaskGP(X, y, task_feature=-1, output_tasks=[1])
    mll = ExactMarginalLogLikelihood(model.likelihood, model)
    _ = fit_gpytorch_model(mll)

    # Acquisition function
    ei = EI(model, best_f=y.min(), maximize=False)
    optimize_acqf(
        acq_function=ei,
        bounds=torch.tensor([[X.min(), X.max()]]).T.float(),
        num_restarts=20,
        q=1,
        raw_samples=100,
    )


### Data Generation
f1 = lambda X: torch.cos(5 * X[:, 0]) ** 2
f2 = lambda X: 1.5 * torch.cos(5 * X[:, 0]) ** 2
gen_inputs = lambda n: torch.rand(n, 1)
gen_obs = lambda X, f, noise: f(X) + noise * torch.rand(X.shape[0])


def generate_data(noise, n_t1, n_t2):
    X1, X2 = gen_inputs(n_t1), gen_inputs(n_t2)
    i1, i2 = torch.zeros(n_t1, 1), torch.ones(n_t2, 1)

    train_X = torch.cat([torch.cat([X1, i1], -1), torch.cat([X2, i2], -1)])

    train_Y_f1 = gen_obs(X1, f1, noise)
    train_Y_f2 = gen_obs(X2, f2, noise)
    train_Y = torch.cat([train_Y_f1, train_Y_f2]).unsqueeze(-1)
    train_Y_mean = train_Y.mean()
    train_Y_std = train_Y.std()
    train_Y_norm = (train_Y - train_Y_mean) / train_Y_std

    return train_X, train_Y_norm


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_mode", default=False, action="store_true")
    parser.add_argument("--cpus_per_trial", default=4, type=int)
    args = parser.parse_args()
    main(args.local_mode, args.cpus_per_trial)

Hi @marcosfelt,

there are a few things to note here.

First, the number of CPUs will impact how many trials can be run in parallel. If you specify 2 CPUs per trial, you can run 2 trials in parallel (as your laptop has 4 CPUs). If you specify 4 CPUs per trial, only 1 trial can run, so the trials are executed sequentially.

Second, I’m not sure if botorch automatically uses multi-threading or multi processing to speed up computations. You might have to tell it specifically to use more resources in order to speed up each individual trial, otherwise it might effectively only use 1 CPU. Then the number of CPUs you specify is really just limiting the concurrency, which is probably not what you want.

Third, these are very fast executing trials (1-2 seconds per trial), while the overhead you observe is mostly due to scheduling and thus constant. You can speed this up a bit if you pass reuse_actors=True to tune.run(), which will reuse remote actors and thus incur less scheduling overhead.

I hope that helps!

HI @kai! Thanks for the quick reply!

I checked this on our linux server and it does seem that botorch uses some sort of multithreading when just running it normally (could see multiple cores being used in htop).

What is strange is that the multithreading seems to work in local mode but not when I allocate the equivalent number of cpus to the trial (both on my Macbook and on a 24 core linux server). Below is a screenshot of the ray dashboard. You can see that only 1 CPU is being used despite 4 being allocated to the trial,

The gap between local mode and cluster mode becomes very apparent when I start running multiple iterations of optimzation.

Am I missing something with how cpus get allocated to trials maybe?

Guess I just needed to think about this some more. As @kai suggested, pytorch automatically sets the number of threads to the number of the on machine, which I’m guessing somehow shows up as only one when running via tune.

I did the following to fix the issue:

  1. Add something to config with the number of CPUs being used:
    tune.run(
        trainable,
        num_samples=10,
        config={"num_threads": cpus_per_trial},
        resources_per_trial={"cpu": int(cpus_per_trial)},
    )
  1. Then, in the trainable function, I set the number of threads:
def trainable(config):
    import torch # Have to import this here
    torch.set_num_threads(config.pop("num_threads"))