Ray.tune use `max_concurrent_trials` to run concurrently is not working

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Description:
I am performing a multi concurrent tune task with ray tune(Ray 2.6.1). The server resources are sufficient,but when the calling and the executing are not in the same file, max_concurrent_trials is not working, maybe what i said was not very clear.
in simple terms, python a.py is working, but python b.py is not working.

a.py:

import os
import tempfile
import time
import unittest

import ray
from catboost import CatBoostClassifier
from ray import tune, train, air
from ray.air import RunConfig, session, Checkpoint, CheckpointConfig
from hyperopt import hp
from ray.tune import ExperimentAnalysis
from ray.tune.search import ConcurrencyLimiter
from ray.tune.search.hyperopt import HyperOptSearch
from sklearn.metrics import f1_score, recall_score, precision_score

from automl.automl.modeling.hpo.ray.callback import LogInfoCallback
from automl.automl.modeling.hpo.ray.reporter import LogInfoReporter


class RayHPO:

    def train(self) -> None:

        os.environ["RAY_AIR_NEW_OUTPUT"] = "0"
        space = {
            "verbose": hp.choice("verbose", [False]),
            "learning_rate": hp.uniform("learning_rate", 5e-3, 0.2),
            "depth": hp.randint("depth", 5, 8),
        }
        ray.init(num_cpus=3, include_dashboard=True, logging_level='error')

        hyperopt_search = HyperOptSearch(space, metric="f1", mode="max")
        reporter = LogInfoReporter(infer_limit=5, max_report_frequency=15)
        callbacks = [LogInfoCallback(metric="f1")]
        tuner = tune.Tuner(
            trainable_demo,
            tune_config=tune.TuneConfig(
                num_samples=20,
                search_alg=hyperopt_search,
                metric="f1",
                mode="max",
                max_concurrent_trials=4,
            ),
            run_config=air.RunConfig(storage_path="/mnt/disk1/tmp/ray_results", name="con",
                                     callbacks=callbacks,
                                     progress_reporter=reporter, verbose=2)
        )
        tuner.fit()

def trainable_demo(config):
    time.sleep(3)
    session.report({"f1": 0.8, "auc": 0.8})

if __name__ == '__main__':
    hpo = RayHPO()
    hpo.train()

From the picture, it can be seen that two trials is executed every 3 seconds.

b.py:

from a import RayHPO
if __name__ == '__main__':
    hpo = RayHPO()
    hpo.train()


From the picture, it can be seen that a trial is executed every 3 seconds.Trial is executed in sequence

When making the trainable_demo sleep for a longer period of time (such as, 60 seconds) , it is working well. Does the trainable_demo running time too short affecting scheduling?

Below shows trainable_demo sleep 3 seconds, and the actor always has only one alive, two pending

@zmin1217 Do the dead actors have any error logs if you click into Log? If you’re doing this relative import from a.py, you may need to attach the working directory to the ray runtime environment: Environment Dependencies — Ray 2.9.0

Thanks for your response, there is no any error info in the dead actor’s stderr log, and not using relative import from a.py.

the dead actor’s err file:

the dead actor’s python-core-worker-**.log:

@zmin1217 I notice that you have a custom search algorithm. In this case, Tune limits the number of pending trials to 1 so that you have some results to fit the searcher before suggesting new trial hyperparams. Could you try setting the environment variable TUNE_MAX_PENDING_TRIALS_PG=4?

1 Like

Thanks for reaching out again, it’s indeed useful when setting environment variable TUNE_MAX_PENDING_TRIALS_PG=4. But i’m a bit curious why trainable_demoruns for a longer time can run concurrently without setting environment variables, as shown in the picture above.