Ray.tune use `max_concurrent_trials` to run concurrently is not working

zmin1217 · January 4, 2024, 11:48am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Description:
I am performing a multi concurrent tune task with ray tune(Ray 2.6.1). The server resources are sufficient，but when the calling and the executing are not in the same file, max_concurrent_trials is not working, maybe what i said was not very clear.
in simple terms, python a.py is working, but python b.py is not working.

a.py:

import os
import tempfile
import time
import unittest

import ray
from catboost import CatBoostClassifier
from ray import tune, train, air
from ray.air import RunConfig, session, Checkpoint, CheckpointConfig
from hyperopt import hp
from ray.tune import ExperimentAnalysis
from ray.tune.search import ConcurrencyLimiter
from ray.tune.search.hyperopt import HyperOptSearch
from sklearn.metrics import f1_score, recall_score, precision_score

from automl.automl.modeling.hpo.ray.callback import LogInfoCallback
from automl.automl.modeling.hpo.ray.reporter import LogInfoReporter


class RayHPO:

    def train(self) -> None:

        os.environ["RAY_AIR_NEW_OUTPUT"] = "0"
        space = {
            "verbose": hp.choice("verbose", [False]),
            "learning_rate": hp.uniform("learning_rate", 5e-3, 0.2),
            "depth": hp.randint("depth", 5, 8),
        }
        ray.init(num_cpus=3, include_dashboard=True, logging_level='error')

        hyperopt_search = HyperOptSearch(space, metric="f1", mode="max")
        reporter = LogInfoReporter(infer_limit=5, max_report_frequency=15)
        callbacks = [LogInfoCallback(metric="f1")]
        tuner = tune.Tuner(
            trainable_demo,
            tune_config=tune.TuneConfig(
                num_samples=20,
                search_alg=hyperopt_search,
                metric="f1",
                mode="max",
                max_concurrent_trials=4,
            ),
            run_config=air.RunConfig(storage_path="/mnt/disk1/tmp/ray_results", name="con",
                                     callbacks=callbacks,
                                     progress_reporter=reporter, verbose=2)
        )
        tuner.fit()

def trainable_demo(config):
    time.sleep(3)
    session.report({"f1": 0.8, "auc": 0.8})

if __name__ == '__main__':
    hpo = RayHPO()
    hpo.train()

From the picture, it can be seen that two trials is executed every 3 seconds.

b.py:

from a import RayHPO
if __name__ == '__main__':
    hpo = RayHPO()
    hpo.train()

From the picture, it can be seen that a trial is executed every 3 seconds.Trial is executed in sequence

zmin1217 · January 5, 2024, 2:20am

When making the trainable_demo sleep for a longer period of time (such as, 60 seconds) , it is working well. Does the trainable_demo running time too short affecting scheduling?

Below shows trainable_demo sleep 3 seconds, and the actor always has only one alive, two pending

justinvyu · January 18, 2024, 12:03am

@zmin1217 Do the dead actors have any error logs if you click into Log? If you’re doing this relative import from a.py, you may need to attach the working directory to the ray runtime environment: Environment Dependencies — Ray 2.9.0

zmin1217 · January 18, 2024, 5:54am

Thanks for your response, there is no any error info in the dead actor’s stderr log, and not using relative import from a.py.

the dead actor’s err file:

the dead actor’s python-core-worker-**.log:

justinvyu · January 22, 2024, 7:07pm

@zmin1217 I notice that you have a custom search algorithm. In this case, Tune limits the number of pending trials to 1 so that you have some results to fit the searcher before suggesting new trial hyperparams. Could you try setting the environment variable TUNE_MAX_PENDING_TRIALS_PG=4?

zmin1217 · January 23, 2024, 2:37am

Thanks for reaching out again, it’s indeed useful when setting environment variable TUNE_MAX_PENDING_TRIALS_PG=4. But i’m a bit curious why trainable_demoruns for a longer time can run concurrently without setting environment variables, as shown in the picture above.

justinvyu · February 29, 2024, 1:23am

This may be a result of TuneConfig(reuse_actors=True), could you try setting that to False? In that case, an actor will be spawned for each trial, rather than Tune trying to share actors across multiple trials.

zmin1217 · February 29, 2024, 2:49am

@justinvyu Thank you very much for your long-term help. when not setting TUNE_MAX_PENDING_TRIALS_PG=4, and setting reuse_actors=False, the running results are still the same as before.
But why reuse_actors=True, the actor always spawned, as shown in the picture below,

Topic		Replies	Views
Parallelly running experiments with Ray Tune on a single Machine Ray Tune	8	77	March 6, 2025
Does Ray Tune restore ignore max_concurrent_trials when restarting errored trials? Ray Tune	2	270	June 30, 2023
Ray Tune gets stuck for infinity Ray Tune	7	17	May 5, 2025
Hyperopt with Ray Tune vs using Hyperopt directly Ray Tune	1	716	February 18, 2021
Trouble with some results from Ray Tune	1	41	August 7, 2024

Ray.tune use `max_concurrent_trials` to run concurrently is not working

Related topics