Ray Tune v2 performance regression

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We are in the process of upgrading Ray from 1.13.0 to 2.2.0.

We are running Ray in a Kubernetes cluster, and noticed a significant performance drop with Ray Tune (between x3 to x5 slower). Below is an example which illustrate our problem, including the Ray Tune changes made as part of the 2.2.0 new API.

Test 1, Ray 1.13.0

Configuration:

from functools import partial

import numpy as np
import ray
from ray import tune
from ray.tune.suggest.basic_variant import BasicVariantGenerator


@ray.remote
def run():
    def train(data_ref, config):
        data = ray.get(data_ref)
        # Simulate compute intense task
        a = np.random.rand(3000, 3000)
        b = np.random.rand(3000, 3000)
        result = np.dot(a, b)
        # We return some data from the compute task, nothing heavy
        return {"b": config["a"]}

    # Simulate passing big parameters to the `train` function
    data = np.random.random(size=100000000)
    data_ref = ray.put(data)

    # Run 100 trials
    config = {"a": tune.grid_search(list(range(100)))}

    search = tune.run(
        partial(train, data_ref),
        search_alg=BasicVariantGenerator(max_concurrent=10),
        config=config,
        resources_per_trial=tune.PlacementGroupFactory([{"CPU": 1}], strategy="PACK"),
        num_samples=1,
    )


if __name__ == "__main__":
    ray.init("ray://127.0.0.1:10001")
    ray.get(run.remote())

Most trials take between 1 and 2 seconds to run.

Total run time: 73.10 seconds

Test 2, Ray 2.20.0

Configuration:

  • Python 3.10.9
  • Ray 2.2.0
  • Ray Helm Chart: Kuberay v0.4.0
  • Same K8s cluster than the first test (same nodes, etc.)
  • Head 4 CPU, 8Gi memory, no processing (–num-cpus=0)
  • 3 workers 4 CPU, 8Gi memory
import numpy as np
import ray
from ray import tune
from ray.tune.search import BasicVariantGenerator


@ray.remote
def run():
    def train(config, data):
        # Simulate compute intense task
        a = np.random.rand(3000, 3000)
        b = np.random.rand(3000, 3000)
        result = np.dot(a, b)
        # We return some data from the compute task, nothing heavy
        return {"b": config["a"]}

    # Simulate passing big parameters to the `train` function
    data = np.random.random(size=100000000)

    # Run 100 trials
    config = {"a": tune.grid_search(list(range(100)))}

    tuner = tune.Tuner(
        tune.with_resources(
            tune.with_parameters(train, data=data),
            tune.PlacementGroupFactory([{"CPU": 1}], strategy="PACK"),
        ),
        param_space=config,
        tune_config=tune.TuneConfig(
            search_alg=BasicVariantGenerator(max_concurrent=10),
            num_samples=1,
        ),
    )

    res = tuner.fit()


if __name__ == "__main__":
    # Connect to the K8s cluster and run
    ray.init("ray://127.0.0.1:10001")
    ray.get(run.remote())

During the run, several warnings are popping up:

The `on_step_end` operation took 0.874 s, which may be a performance bottleneck.
The `callbacks.on_trial_result` operation took 0.538 s, which may be a performance bottleneck.
The `process_trial_result` operation took 0.539 s, which may be a performance bottleneck

15 trials take between 20s and 60s to run.

Total run time: 291.01 seconds

1 Like

Thanks for the thorough writeup!

I’ve ran this locally and can’t reproduce on a single node:

Script Ray 1.13 Ray 2.2
1.13 script 53.05 -
2.2 script 56.02 57.59

This indicates that the regression you see is due to multi node behavior, e.g. syncing of trial information.

The warnings are interesting - process_trial_result includes callbacks.on_trial_result so the bottleneck is in the callbacks. Can you try setting the env variable TUNE_DISABLE_AUTO_CALLBACK_SYNCER=1 and run it again (2.2 on the new script)? And if that doesn’t change anything, try also settting TUNE_DISABLE_AUTO_CALLBACK_LOGGERS=1?

on_step_end is even more interesting - this is just a resource cleanup. This makes me wonder if there’s a regression in KubeRay compared to the 1.13 helm chart.

I’ll try to get a Kubernetes setup going and look into this. I’m also happy to pair debug on this if you’re available?

Quick update, I can actually reproduce this on a multinode cluster (without Kuberay), so I’ll look into this closer today.

It looks like this is indeed a problem in the trial synchronization. If you pass

sync_config=tune.SyncConfig(syncer=None)

either to tune.run() (in the 1.13 script) or to the RunConfig( in the 2.2 script) it will speed up training to the expected levels:

(run pid=61223) 2023-01-31 11:49:07,371 INFO tune.py:762 -- Total run time: 57.61 seconds (57.10 seconds for the tuning loop).

I’ll see if there is something we can do to speed this up even with syncing.

Thanks for looking at this @kai!

In 1.13.0, run_config is defaulting to None. With 2.2.0, it defaults to "auto".

I tried to set the syncer to None in the 2.2.0 script:

    tuner = tune.Tuner(
        ...
        run_config=air.RunConfig(sync_config=tune.SyncConfig(syncer=None)),
    )

But I am still seeing the same behavior (the multiple warnings posted previously, etc.)

Total run time: 280.7 seconds

Hm, interesting. For me in the 2.2 script I get a large speedup with the runconfig + syncconfig setting. Last run:

(run pid=75347) 2023-01-31 12:54:51,692 INFO tune.py:762 -- Total run time: 50.64 seconds (50.38 seconds for the tuning loop).

In your case, it seems very off that some trials take so much time individually (100 seconds+). It’s also suspicious that they are the first 10 trials - I’m wondering if the object store communication (for with_parameters) takes a long time and if Kubernetes/KubeRay is to blame here.

I see you don’t want to share the loc column, but can you indiciate (for the first ~14 items) maybe with colors which trials are running on the same node? If I had to guess, it looks like trial 0 and 5 may be running on the same node (possibly the node where the remote wrapper function is being executed).

For a bit more context, one of the main differences for the syncing is that in 1.13 we used ssh-based syncing whereas in Ray > 2 we use object-store based syncing. But since SSH is not available on Kubernetes, it actually defaults to not syncing at all. I’m a bit surprised that this does not decrease your runtime at all - what are the longest per-trial runtimes you’re seeing? Are any of the later trials taking a long time?

Edit:

I now actually also ran in straggling trials, and in my case all the stragglers ran on the same node, which was also the node where the script was executed:

run pid=82732) +-------------------+------------+----------------------+-----+--------+------------------+-----+
(run pid=82732) | Trial name        | status     | loc                  |   a |   iter |   total time (s) |   b |
(run pid=82732) |-------------------+------------+----------------------+-----+--------+------------------+-----|
(run pid=82732) | train_f9607_00000 | TERMINATED | 172.31.252.64:22738  |   0 |      1 |         2.14306  |   0 |
(run pid=82732) | train_f9607_00001 | TERMINATED | 172.31.153.19:82906  |   1 |      1 |        47.8846   |   1 |
(run pid=82732) | train_f9607_00002 | TERMINATED | 172.31.148.122:24520 |   2 |      1 |         2.46509  |   2 |
(run pid=82732) | train_f9607_00003 | TERMINATED | 172.31.186.223:29668 |   3 |      1 |         2.80529  |   3 |
(run pid=82732) | train_f9607_00004 | TERMINATED | 172.31.153.19:82908  |   4 |      1 |        52.4088   |   4 |
(run pid=82732) | train_f9607_00005 | TERMINATED | 172.31.148.122:24521 |   5 |      1 |         1.23845  |   5 |
...

Ran both scripts (and including the loc column this time):

1.13.0

run pid=56948, ip=10.95.3.199) == Status ==
(run pid=56948, ip=10.95.3.199) Current time: 2023-01-31 14:27:02 (running for 00:00:52.82)
(run pid=56948, ip=10.95.3.199) Memory usage on this node: 6.9/31.4 GiB
(run pid=56948, ip=10.95.3.199) Using FIFO scheduling algorithm.
(run pid=56948, ip=10.95.3.199) Resources requested: 0/12 CPUs, 0/0 GPUs, 0.0/22.4 GiB heap, 0.0/9.34 GiB objects
(run pid=56948, ip=10.95.3.199) Result logdir: /home/ray/ray_results/train_2023-01-31_14-26-09
(run pid=56948, ip=10.95.3.199) Number of trials: 100/100 (100 TERMINATED)
(run pid=56948, ip=10.95.3.199) +-------------------+------------+-------------------+-----+--------+------------------+-----+
(run pid=56948, ip=10.95.3.199) | Trial name        | status     | loc               |   a |   iter |   total time (s) |   b |
(run pid=56948, ip=10.95.3.199) |-------------------+------------+-------------------+-----+--------+------------------+-----|
(run pid=56948, ip=10.95.3.199) | train_426ea_00000 | TERMINATED | 10.95.3.199:57011 |   0 |      1 |          1.76228 |   0 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00001 | TERMINATED | 10.95.3.199:57093 |   1 |      1 |          2.08336 |   1 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00002 | TERMINATED | 10.95.7.197:56485 |   2 |      1 |          4.56515 |   2 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00003 | TERMINATED | 10.95.7.197:56486 |   3 |      1 |          4.57111 |   3 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00004 | TERMINATED | 10.95.0.135:56884 |   4 |      1 |          4.95213 |   4 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00005 | TERMINATED | 10.95.0.135:56885 |   5 |      1 |          4.9252  |   5 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00006 | TERMINATED | 10.95.7.197:56487 |   6 |      1 |          4.58786 |   6 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00007 | TERMINATED | 10.95.0.135:56886 |   7 |      1 |          4.96597 |   7 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00008 | TERMINATED | 10.95.7.197:56488 |   8 |      1 |          4.46997 |   8 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00009 | TERMINATED | 10.95.0.135:56887 |   9 |      1 |          4.90358 |   9 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00010 | TERMINATED | 10.95.3.199:57132 |  10 |      1 |          1.68094 |  10 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00011 | TERMINATED | 10.95.3.199:57217 |  11 |      1 |          3.57757 |  11 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00012 | TERMINATED | 10.95.3.199:57247 |  12 |      1 |          4.88524 |  12 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00013 | TERMINATED | 10.95.7.197:56485 |  13 |      1 |          1.6866  |  13 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00014 | TERMINATED | 10.95.7.197:56486 |  14 |      1 |          1.73959 |  14 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00015 | TERMINATED | 10.95.7.197:56487 |  15 |      1 |          1.67977 |  15 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00016 | TERMINATED | 10.95.0.135:56887 |  16 |      1 |          1.8285  |  16 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00017 | TERMINATED | 10.95.0.135:56886 |  17 |      1 |          1.92963 |  17 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00018 | TERMINATED | 10.95.0.135:56885 |  18 |      1 |          1.85684 |  18 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00019 | TERMINATED | 10.95.0.135:56884 |  19 |      1 |          1.83398 |  19 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00020 | TERMINATED | 10.95.3.199:57516 |  20 |      1 |          1.71036 |  20 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00021 | TERMINATED | 10.95.7.197:56486 |  21 |      1 |          2.66463 |  21 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00022 | TERMINATED | 10.95.7.197:56487 |  22 |      1 |          2.85354 |  22 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00023 | TERMINATED | 10.95.7.197:56932 |  23 |      1 |          5.13414 |  23 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00024 | TERMINATED | 10.95.7.197:56487 |  24 |      1 |          4.94336 |  24 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00025 | TERMINATED | 10.95.3.199:57247 |  25 |      1 |          4.83179 |  25 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00026 | TERMINATED | 10.95.0.135:56887 |  26 |      1 |          1.7902  |  26 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00027 | TERMINATED | 10.95.0.135:56884 |  27 |      1 |          1.82473 |  27 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00028 | TERMINATED | 10.95.7.197:56486 |  28 |      1 |          4.54313 |  28 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00029 | TERMINATED | 10.95.3.199:57516 |  29 |      1 |          5.91661 |  29 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00030 | TERMINATED | 10.95.0.135:56885 |  30 |      1 |          1.82559 |  30 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00031 | TERMINATED | 10.95.0.135:56886 |  31 |      1 |          1.82649 |  31 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00032 | TERMINATED | 10.95.0.135:56887 |  32 |      1 |          1.93893 |  32 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00033 | TERMINATED | 10.95.0.135:56884 |  33 |      1 |          2.00254 |  33 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00034 | TERMINATED | 10.95.0.135:56885 |  34 |      1 |          2.05942 |  34 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00035 | TERMINATED | 10.95.0.135:56886 |  35 |      1 |          2.23067 |  35 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00036 | TERMINATED | 10.95.7.197:57213 |  36 |      1 |          2.0536  |  36 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00037 | TERMINATED | 10.95.0.135:56884 |  37 |      1 |          1.71291 |  37 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00038 | TERMINATED | 10.95.0.135:56885 |  38 |      1 |          1.70811 |  38 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00039 | TERMINATED | 10.95.0.135:56886 |  39 |      1 |          1.71251 |  39 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00040 | TERMINATED | 10.95.7.197:56486 |  40 |      1 |          3.83534 |  40 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00041 | TERMINATED | 10.95.3.199:57247 |  41 |      1 |          6.06563 |  41 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00042 | TERMINATED | 10.95.7.197:56487 |  42 |      1 |          3.8178  |  42 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00043 | TERMINATED | 10.95.7.197:56932 |  43 |      1 |          3.62991 |  43 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00044 | TERMINATED | 10.95.0.135:56884 |  44 |      1 |          1.72628 |  44 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00045 | TERMINATED | 10.95.3.199:57516 |  45 |      1 |          5.80795 |  45 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00046 | TERMINATED | 10.95.0.135:56885 |  46 |      1 |          1.70903 |  46 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00047 | TERMINATED | 10.95.0.135:56886 |  47 |      1 |          1.65775 |  47 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00048 | TERMINATED | 10.95.0.135:56884 |  48 |      1 |          1.66402 |  48 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00049 | TERMINATED | 10.95.7.197:56487 |  49 |      1 |          2.03086 |  49 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00050 | TERMINATED | 10.95.7.197:56486 |  50 |      1 |          2.06001 |  50 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00051 | TERMINATED | 10.95.3.199:57247 |  51 |      1 |          6.66656 |  51 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00052 | TERMINATED | 10.95.7.197:56932 |  52 |      1 |          2.17114 |  52 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00053 | TERMINATED | 10.95.0.135:56886 |  53 |      1 |          1.66996 |  53 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00054 | TERMINATED | 10.95.0.135:56885 |  54 |      1 |          1.66362 |  54 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00055 | TERMINATED | 10.95.3.199:57516 |  55 |      1 |          6.61001 |  55 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00056 | TERMINATED | 10.95.7.197:57213 |  56 |      1 |          2.12672 |  56 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00057 | TERMINATED | 10.95.0.135:56884 |  57 |      1 |          1.683   |  57 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00058 | TERMINATED | 10.95.7.197:56487 |  58 |      1 |          2.0318  |  58 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00059 | TERMINATED | 10.95.0.135:56886 |  59 |      1 |          1.68986 |  59 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00060 | TERMINATED | 10.95.7.197:56486 |  60 |      1 |          2.07001 |  60 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00061 | TERMINATED | 10.95.0.135:56885 |  61 |      1 |          1.71859 |  61 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00062 | TERMINATED | 10.95.7.197:56932 |  62 |      1 |          2.18271 |  62 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00063 | TERMINATED | 10.95.0.135:56884 |  63 |      1 |          1.71984 |  63 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00064 | TERMINATED | 10.95.7.197:57213 |  64 |      1 |          2.17055 |  64 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00065 | TERMINATED | 10.95.0.135:56886 |  65 |      1 |          1.70144 |  65 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00066 | TERMINATED | 10.95.7.197:56487 |  66 |      1 |          2.5934  |  66 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00067 | TERMINATED | 10.95.0.135:56885 |  67 |      1 |          1.70496 |  67 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00068 | TERMINATED | 10.95.7.197:56486 |  68 |      1 |          3.7502  |  68 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00069 | TERMINATED | 10.95.7.197:56932 |  69 |      1 |          3.27883 |  69 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00070 | TERMINATED | 10.95.0.135:56884 |  70 |      1 |          1.66711 |  70 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00071 | TERMINATED | 10.95.0.135:56886 |  71 |      1 |          1.65352 |  71 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00072 | TERMINATED | 10.95.7.197:57213 |  72 |      1 |          3.24296 |  72 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00073 | TERMINATED | 10.95.0.135:56885 |  73 |      1 |          1.69178 |  73 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00074 | TERMINATED | 10.95.3.199:57247 |  74 |      1 |          6.19315 |  74 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00075 | TERMINATED | 10.95.7.197:56487 |  75 |      1 |          3.6209  |  75 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00076 | TERMINATED | 10.95.0.135:56884 |  76 |      1 |          1.66786 |  76 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00077 | TERMINATED | 10.95.0.135:56886 |  77 |      1 |          1.68222 |  77 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00078 | TERMINATED | 10.95.3.199:57516 |  78 |      1 |          6.6275  |  78 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00079 | TERMINATED | 10.95.0.135:56885 |  79 |      1 |          1.70662 |  79 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00080 | TERMINATED | 10.95.7.197:56932 |  80 |      1 |          2.58807 |  80 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00081 | TERMINATED | 10.95.7.197:56486 |  81 |      1 |          2.76165 |  81 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00082 | TERMINATED | 10.95.0.135:56884 |  82 |      1 |          1.72775 |  82 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00083 | TERMINATED | 10.95.0.135:56886 |  83 |      1 |          1.67081 |  83 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00084 | TERMINATED | 10.95.7.197:57213 |  84 |      1 |          2.3103  |  84 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00085 | TERMINATED | 10.95.0.135:56885 |  85 |      1 |          1.70849 |  85 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00086 | TERMINATED | 10.95.0.135:56884 |  86 |      1 |          1.68958 |  86 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00087 | TERMINATED | 10.95.7.197:56487 |  87 |      1 |          2.15398 |  87 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00088 | TERMINATED | 10.95.7.197:56932 |  88 |      1 |          2.37201 |  88 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00089 | TERMINATED | 10.95.0.135:56886 |  89 |      1 |          1.71201 |  89 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00090 | TERMINATED | 10.95.7.197:56486 |  90 |      1 |          2.07772 |  90 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00091 | TERMINATED | 10.95.0.135:56885 |  91 |      1 |          1.75702 |  91 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00092 | TERMINATED | 10.95.7.197:57213 |  92 |      1 |          2.18066 |  92 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00093 | TERMINATED | 10.95.0.135:56884 |  93 |      1 |          1.68484 |  93 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00094 | TERMINATED | 10.95.0.135:56886 |  94 |      1 |          1.69372 |  94 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00095 | TERMINATED | 10.95.7.197:56487 |  95 |      1 |          1.87231 |  95 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00096 | TERMINATED | 10.95.3.199:57247 |  96 |      1 |          5.12555 |  96 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00097 | TERMINATED | 10.95.7.197:56486 |  97 |      1 |          1.84924 |  97 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00098 | TERMINATED | 10.95.0.135:56885 |  98 |      1 |          1.71463 |  98 |
(run pid=56948, ip=10.95.3.199) | train_426ea_00099 | TERMINATED | 10.95.7.197:56932 |  99 |      1 |          1.98175 |  99 |
(run pid=56948, ip=10.95.3.199) +-------------------+------------+-------------------+-----+--------+------------------+-----+

2.2.0

(with the runconfig + syncconfig)

run pid=4569, ip=10.137.2.135) == Status ==
(run pid=4569, ip=10.137.2.135) Current time: 2023-01-31 14:21:37 (running for 00:05:20.95)
(run pid=4569, ip=10.137.2.135) Memory usage on this node: 5.6/31.4 GiB
(run pid=4569, ip=10.137.2.135) Using FIFO scheduling algorithm.
(run pid=4569, ip=10.137.2.135) Resources requested: 0/12 CPUs, 0/0 GPUs, 0.0/32.0 GiB heap, 0.0/9.48 GiB objects
(run pid=4569, ip=10.137.2.135) Result logdir: /home/ray/ray_results/train_2023-01-31_14-16-16
(run pid=4569, ip=10.137.2.135) Number of trials: 100/100 (100 TERMINATED)
(run pid=4569, ip=10.137.2.135) +-------------------+------------+-------------------+-----+--------+------------------+-----+
(run pid=4569, ip=10.137.2.135) | Trial name        | status     | loc               |   a |   iter |   total time (s) |   b |
(run pid=4569, ip=10.137.2.135) |-------------------+------------+-------------------+-----+--------+------------------+-----|
(run pid=4569, ip=10.137.2.135) | train_e10a0_00000 | TERMINATED | 10.137.2.135:4730 |   0 |      1 |         2.8155   |   0 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00001 | TERMINATED | 10.137.2.135:4948 |   1 |      1 |        42.0916   |   1 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00002 | TERMINATED | 10.137.2.2:1196   |   2 |      1 |        56.5312   |   2 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00003 | TERMINATED | 10.137.2.2:1197   |   3 |      1 |        53.9542   |   3 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00004 | TERMINATED | 10.137.0.200:8042 |   4 |      1 |        53.8982   |   4 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00005 | TERMINATED | 10.137.0.200:8043 |   5 |      1 |        53.6154   |   5 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00006 | TERMINATED | 10.137.2.135:4954 |   6 |      1 |        31.3656   |   6 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00007 | TERMINATED | 10.137.2.2:1198   |   7 |      1 |        53.7435   |   7 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00008 | TERMINATED | 10.137.0.200:8044 |   8 |      1 |        53.4172   |   8 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00009 | TERMINATED | 10.137.2.135:4957 |   9 |      1 |        42.0906   |   9 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00010 | TERMINATED | 10.137.0.200:8264 |  10 |      1 |         1.86836  |  10 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00011 | TERMINATED | 10.137.2.135:4948 |  11 |      1 |         4.21502  |  11 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00012 | TERMINATED | 10.137.0.200:8044 |  12 |      1 |        36.0631   |  12 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00013 | TERMINATED | 10.137.0.200:8042 |  13 |      1 |        36.3407   |  13 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00014 | TERMINATED | 10.137.0.200:8043 |  14 |      1 |        36.0622   |  14 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00015 | TERMINATED | 10.137.2.135:4957 |  15 |      1 |         7.11884  |  15 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00016 | TERMINATED | 10.137.2.2:1395   |  16 |      1 |        40.805    |  16 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00017 | TERMINATED | 10.137.2.2:1197   |  17 |      1 |         1.8442   |  17 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00018 | TERMINATED | 10.137.2.2:1196   |  18 |      1 |         2.0101   |  18 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00019 | TERMINATED | 10.137.2.135:5208 |  19 |      1 |        45.8674   |  19 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00020 | TERMINATED | 10.137.2.135:4957 |  20 |      1 |         4.00654  |  20 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00021 | TERMINATED | 10.137.2.2:1196   |  21 |      1 |        47.7346   |  21 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00022 | TERMINATED | 10.137.2.135:4948 |  22 |      1 |         2.98643  |  22 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00023 | TERMINATED | 10.137.2.2:1197   |  23 |      1 |        49.2904   |  23 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00024 | TERMINATED | 10.137.2.135:4948 |  24 |      1 |        49.9281   |  24 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00025 | TERMINATED | 10.137.2.135:4957 |  25 |      1 |        49.8534   |  25 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00026 | TERMINATED | 10.137.0.200:8043 |  26 |      1 |         0.82917  |  26 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00027 | TERMINATED | 10.137.0.200:8042 |  27 |      1 |         0.865061 |  27 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00028 | TERMINATED | 10.137.0.200:8044 |  28 |      1 |         0.885888 |  28 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00029 | TERMINATED | 10.137.0.200:8043 |  29 |      1 |         0.817324 |  29 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00030 | TERMINATED | 10.137.2.2:1754   |  30 |      1 |         1.23039  |  30 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00031 | TERMINATED | 10.137.0.200:8347 |  31 |      1 |         1.39202  |  31 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00032 | TERMINATED | 10.137.0.200:8044 |  32 |      1 |         0.910604 |  32 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00033 | TERMINATED | 10.137.2.2:1847   |  33 |      1 |         1.79787  |  33 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00034 | TERMINATED | 10.137.2.2:1196   |  34 |      1 |        19.5816   |  34 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00035 | TERMINATED | 10.137.2.135:5208 |  35 |      1 |        27.8341   |  35 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00036 | TERMINATED | 10.137.2.2:1754   |  36 |      1 |        19.142    |  36 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00037 | TERMINATED | 10.137.0.200:8347 |  37 |      1 |        48.7963   |  37 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00038 | TERMINATED | 10.137.2.135:4948 |  38 |      1 |        38.4528   |  38 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00039 | TERMINATED | 10.137.0.200:8042 |  39 |      1 |        51.3409   |  39 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00040 | TERMINATED | 10.137.2.135:4957 |  40 |      1 |        38.4014   |  40 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00041 | TERMINATED | 10.137.0.200:8044 |  41 |      1 |        51.5806   |  41 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00042 | TERMINATED | 10.137.2.2:1911   |  42 |      1 |         1.34126  |  42 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00043 | TERMINATED | 10.137.2.2:1754   |  43 |      1 |         1.32023  |  43 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00044 | TERMINATED | 10.137.2.135:5208 |  44 |      1 |         9.00817  |  44 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00045 | TERMINATED | 10.137.2.2:1911   |  45 |      1 |         1.33171  |  45 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00046 | TERMINATED | 10.137.2.2:1847   |  46 |      1 |         1.58881  |  46 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00047 | TERMINATED | 10.137.2.2:1847   |  47 |      1 |         1.22493  |  47 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00048 | TERMINATED | 10.137.2.135:4957 |  48 |      1 |        52.2693   |  48 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00049 | TERMINATED | 10.137.2.2:1754   |  49 |      1 |         1.61392  |  49 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00050 | TERMINATED | 10.137.2.135:4948 |  50 |      1 |        51.5663   |  50 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00051 | TERMINATED | 10.137.2.2:1911   |  51 |      1 |        19.3781   |  51 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00052 | TERMINATED | 10.137.2.135:5208 |  52 |      1 |        48.1152   |  52 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00053 | TERMINATED | 10.137.2.2:1847   |  53 |      1 |        22.2116   |  53 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00054 | TERMINATED | 10.137.2.2:1754   |  54 |      1 |        22.7817   |  54 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00055 | TERMINATED | 10.137.0.200:8577 |  55 |      1 |         0.992126 |  55 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00056 | TERMINATED | 10.137.0.200:8042 |  56 |      1 |         1.03513  |  56 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00057 | TERMINATED | 10.137.2.2:2074   |  57 |      1 |         1.24456  |  57 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00058 | TERMINATED | 10.137.2.2:1754   |  58 |      1 |         1.22544  |  58 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00059 | TERMINATED | 10.137.2.2:1911   |  59 |      1 |         1.13638  |  59 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00060 | TERMINATED | 10.137.2.2:2074   |  60 |      1 |         1.13766  |  60 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00061 | TERMINATED | 10.137.0.200:8577 |  61 |      1 |         1.17215  |  61 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00062 | TERMINATED | 10.137.0.200:8842 |  62 |      1 |         0.87288  |  62 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00063 | TERMINATED | 10.137.2.2:1754   |  63 |      1 |         1.14188  |  63 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00064 | TERMINATED | 10.137.2.135:4957 |  64 |      1 |        38.8441   |  64 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00065 | TERMINATED | 10.137.0.200:8577 |  65 |      1 |        24.2404   |  65 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00066 | TERMINATED | 10.137.0.200:8842 |  66 |      1 |        25.8941   |  66 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00067 | TERMINATED | 10.137.2.135:5208 |  67 |      1 |        39.1757   |  67 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00068 | TERMINATED | 10.137.2.2:2074   |  68 |      1 |        27.4164   |  68 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00069 | TERMINATED | 10.137.2.2:1911   |  69 |      1 |        27.991    |  69 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00070 | TERMINATED | 10.137.0.200:8042 |  70 |      1 |        25.6123   |  70 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00071 | TERMINATED | 10.137.2.2:1754   |  71 |      1 |        28.3527   |  71 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00072 | TERMINATED | 10.137.2.135:4948 |  72 |      1 |        33.2123   |  72 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00073 | TERMINATED | 10.137.0.200:8577 |  73 |      1 |         0.825096 |  73 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00074 | TERMINATED | 10.137.2.2:2529   |  74 |      1 |         1.66366  |  74 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00075 | TERMINATED | 10.137.0.200:8042 |  75 |      1 |         2.12771  |  75 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00076 | TERMINATED | 10.137.2.2:1754   |  76 |      1 |         1.15653  |  76 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00077 | TERMINATED | 10.137.0.200:8957 |  77 |      1 |         1.26463  |  77 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00078 | TERMINATED | 10.137.0.200:9033 |  78 |      1 |         1.45506  |  78 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00079 | TERMINATED | 10.137.2.2:1754   |  79 |      1 |         2.21953  |  79 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00080 | TERMINATED | 10.137.0.200:8042 |  80 |      1 |         2.93848  |  80 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00081 | TERMINATED | 10.137.2.2:1911   |  81 |      1 |        57.5244   |  81 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00082 | TERMINATED | 10.137.2.2:2529   |  82 |      1 |        15.921    |  82 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00083 | TERMINATED | 10.137.2.135:4948 |  83 |      1 |        34.4612   |  83 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00084 | TERMINATED | 10.137.2.135:4957 |  84 |      1 |        36.0164   |  84 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00085 | TERMINATED | 10.137.2.2:2074   |  85 |      1 |        56.9236   |  85 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00086 | TERMINATED | 10.137.2.135:5208 |  86 |      1 |        36.0996   |  86 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00087 | TERMINATED | 10.137.0.200:9095 |  87 |      1 |         0.897899 |  87 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00088 | TERMINATED | 10.137.0.200:8042 |  88 |      1 |         0.814859 |  88 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00089 | TERMINATED | 10.137.0.200:9252 |  89 |      1 |         1.00068  |  89 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00090 | TERMINATED | 10.137.0.200:9033 |  90 |      1 |         0.851022 |  90 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00091 | TERMINATED | 10.137.0.200:8042 |  91 |      1 |         0.804057 |  91 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00092 | TERMINATED | 10.137.0.200:9252 |  92 |      1 |         0.894551 |  92 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00093 | TERMINATED | 10.137.0.200:8042 |  93 |      1 |         0.802148 |  93 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00094 | TERMINATED | 10.137.0.200:9252 |  94 |      1 |         0.869299 |  94 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00095 | TERMINATED | 10.137.2.2:2529   |  95 |      1 |        28.1904   |  95 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00096 | TERMINATED | 10.137.0.200:9252 |  96 |      1 |         2.14605  |  96 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00097 | TERMINATED | 10.137.0.200:9033 |  97 |      1 |         2.84002  |  97 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00098 | TERMINATED | 10.137.2.135:5208 |  98 |      1 |        26.2755   |  98 |
(run pid=4569, ip=10.137.2.135) | train_e10a0_00099 | TERMINATED | 10.137.2.135:4948 |  99 |      1 |        25.8063   |  99 |
(run pid=4569, ip=10.137.2.135) +-------------------+------------+-------------------+-----+--------+------------------+-----+

It does not look like it is happening on a single node. I confirm that nothing is running in the HEAD in both scenarios.

Thanks! It looks like we have two separate issues here:

  1. The syncing - at least that’s what gives me the regression in my reproduction. It doesn’t help in your case because it’s masked by the second issue

  2. Individual trials take a long time with Ray 2.2. If we consider blocks of 10 where most trials run for > 20 seconds this easily adds up to 200+ seconds.

For the second case, I’m wondering what the actual issue here is.

Can you try running

export OMP_NUM_THREADS=4

and passing

    ray.init("ray://127.0.0.1:10001", runtime_env={"env_vars": {"OMP_NUM_THREADS": "4"}}))

to see if this makes a difference? I’m also happy to jump on a call to debug if you’d like.

I’ll also try to reproduce this on our own Kubernetes setup, but I’ll mitigate the syncing issue first.

Setting OMP_NUM_THREADS or not seems to have a high impact. I tested different things, and OMP_NUM_THREADS=1 is giving good results.

Ray Version OMP_NUM_THREADS Total Time Max # > 10s
Ray 2.2.0 [Not Set] 349.55 70.1016 43
Ray 2.2.0 [Not Set] 237.92 69.2617 40
Ray 2.2.0 [Not Set] 200.94 46.1469 28
Ray 2.2.0 4 125.72 45.2951 17
Ray 2.2.0 4 110.99 39.7515 16
Ray 2.2.0 4 105.85 17.5634 10
Ray 2.2.0 3 58.15 7.94047 0
Ray 2.2.0 3 62.78 7.47301 0
Ray 2.2.0 3 53.92 7.1088 0
Ray 2.2.0 2 54.4 4.47918 0
Ray 2.2.0 2 54.47 4.28362 0
Ray 2.2.0 2 54.84 4.74331 0
Ray 2.2.0 1 52.26 6.39861 0
Ray 2.2.0 1 54.21 7.23307 0
Ray 2.2.0 1 48.77 7.90307 0
Ray 1.13.0 [Not Set] 71.92 6.48492 0
Ray 1.13.0 1 77.68 7.49564 0

The documentation is saying that Ray sets OMP_NUM_THREADS=<num_cpus> by default. I am not sure why we have such a big difference between not setting OMP_NUM_THREADS and OMP_NUM_THREADS=4, as we should have 4 by default (we are running workers with 4 CPUs).

I am getting the best minimum trial with OMP_NUM_THREADS=4 which probably makes sense as np.dot would leverage the 4 CPUs (via BLAS). However, it is creating contention and deliver a worse performance overall.

Finally, I am wondering what is the behavior with 1.13.0. It seems slower overall, setting OMP_NUM_THREADS=1 seems to be worse.

EDIT: All the 2.2.0 tests above have been run with syncer=None.