Ray Tune: part of trainable 10 times slower in a ray tune trial than without ray

Hi,

Inside my trainable I am training the SMOTE algorithm from imbalanced-learn which is based on the NearestNeighbor scikit-learn implementation. I have a parameter space with only one config, so I have only one trial and I set the trial resources to be the ones of my machine. The part of the trainable training the SMOTE algorithm is 10 times slower in the ray trial than without using ray. I am not talking about the full execution time of the script (which can suffer from overhead) but only about the part where I train this algorithm. I also set the OMP_NUM_THREADS to be the total number of CPUs available on my machine. I am thus trying to understand what could be explaining this.

I will try to reproduce it with a much smaller code than the one that I am currently using.

Thanks!

1 Like

Hi @albertcthomas,

Ray (Tune) does not do anything special except for setting the OMP_NUM_THREADS variable if it’s unset. Where do you set it? Can you confirm in your training function that it is set by printing it out?

Please note that it may not be enough to just set it in the training function, e.g.

def train(config):
    os.environ["OMP_NUM_THREADS"] = "8"

may not work as many libraries (e.g. pandas) set their internal state based on this variable on import. Thus you may have to set this in the environment before starting Python.

Hi @kai ,

Thanks a lot for the quick reply. I created a minimum reproducing example but turns out the problem is depending on the machine on which I run the script.

with ray

import os
import time

from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

from ray import tune
from ray.air import RunConfig
from ray.tune import CLIReporter


def trainable(parameter_space):

    os.environ["OMP_NUM_THREADS"] = '56'
    print(os.environ["OMP_NUM_THREADS"])

    k_neighbors = parameter_space["k_neighbors"]

    X, y = make_classification(
        n_classes=2,
        weights=[0.4, 0.6],
        n_features=100,
        n_samples=100_000,
        random_state=10)

    sm = SMOTE(k_neighbors=k_neighbors)
    start = time.time()
    _, _ = sm.fit_resample(X, y)
    print('------ Time ------:', (time.time() - start))


trainable = tune.with_resources(trainable, {"cpu": 56, "gpu": 6})
reporter = CLIReporter(max_report_frequency=300)
tuner = tune.Tuner(
    trainable,
    param_space={"k_neighbors": 10},
    run_config=RunConfig(progress_reporter=reporter)
)
results = tuner.fit()

without ray

import os
import time

from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

k_neighbors = 10

X, y = make_classification(
    n_classes=2,
    weights=[0.4, 0.6],
    n_features=100,
    n_samples=100_000,
    random_state=10)

sm = SMOTE(k_neighbors=k_neighbors)
start = time.time()
_, _ = sm.fit_resample(X, y)
print('Time:', (time.time() - start))

This is a CentOS server with 56 CPUs and 6 GPUs.
Running the script without ray returns

Time: 3.4180803298950195

Running the script with ray returns

2023-01-27 22:15:28,785 INFO worker.py:1538 -- Started a local Ray instance.
== Status ==
Current time: 2023-01-27 22:15:34 (running for 00:00:02.43)
Memory usage on this node: 22.9/503.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 56.0/56 CPUs, 6.0/6 GPUs, 0.0/208.61 GiB heap, 0.0/93.39 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: ~/ray_results/trainable_2023-01-27_22-15-26
Number of trials: 1/1 (1 RUNNING)
+-----------------------+----------+--------------------+
| Trial name            | status   | loc                |
|-----------------------+----------+--------------------|
| trainable_0e5d3_00000 | RUNNING  | 10.206.42.11:52335 |
+-----------------------+----------+--------------------+


(trainable pid=52335) 56
Trial trainable_0e5d3_00000 completed. Last result:
== Status ==
Current time: 2023-01-27 22:15:53 (running for 00:00:22.32)
Memory usage on this node: 22.9/503.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/56 CPUs, 0/6 GPUs, 0.0/208.61 GiB heap, 0.0/93.39 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: ~/ray_results/trainable_2023-01-27_22-15-26
Number of trials: 1/1 (1 TERMINATED)
+-----------------------+------------+--------------------+
| Trial name            | status     | loc                |
|-----------------------+------------+--------------------|
| trainable_0e5d3_00000 | TERMINATED | 10.206.42.11:52335 |
+-----------------------+------------+--------------------+


(trainable pid=52335) ------ Time ------: 18.864986181259155
2023-01-27 22:15:54,064 INFO tune.py:762 -- Total run time: 23.17 seconds (22.31 seconds for the tuning loop).

So it takes approx. 5 more times with ray. I tried on an Ubuntu EC2 machine with 8 CPUs, thus using os.environ["OMP_NUM_THREADS"] = '8' and trainable = tune.with_resources(trainable, {"cpu": 8}) and I got approximately the same time with ray and without ray… (~ 2.5 seconds).

I don’t know if you have any clue on where the issue could come from with the first bigger machine. I tried setting gpu to 0 in the tune.with_resources but this does not change anything.

Thanks again.

Hi,

I also couldn’t reproduce the problem on my machine - I get the same times with Ray Tune and without it, irrespective of the OMP_NUM_THREADS.

Just for clarification for my last answer, setting OMP_NUM_THREADS in the trainable will probably not work as the variables are often read on package import. Instead you have to set it on the command line before running the script.

E.g.

export OMP_NUM_THREADS=56
python train.py

Can you try this on your CentOS server again?

Yes! Doing export OMP_NUM_THREADS=56 before running python works! I now get the same times with ray and without! Thanks a lot for your help :slight_smile:

For the record, on my Ubuntu EC2 machine, if I remove the two lines

    os.environ["OMP_NUM_THREADS"] = '8'
    print(os.environ["OMP_NUM_THREADS"])

it is then much slower (~12 seconds compared to the 2.5 seconds with these lines).
Still without these lines but with export OMP_NUM_THREADS=8 I’m back to ~ 2 seconds.

One last question, now that I will have multiple trials, shouldn’t I set an OMP_NUM_THREADS number for each trial?

Generally if you run e.g. 8 trials in parallel, you should use e.g. 7 CPUs (and OMP_NUM_THREADS=7) per trial so that it’s evenly split up.

If you use a higher number, the trials will compete for the resources - you likely won’t get more speedup, but in your simple case I don’t think you would run into problems.

1 Like

Ok thanks again. The number I set to export OMP_NUM_THREADS before calling python is the number that will be read and used by ray for each trial?

Yes. Technically, the number you set OMP_NUM_THREADS to is the number that will be used for every Ray actor or task, including Ray Tune’s remote trainables.

1 Like