Tuner.fit() never terminates

bmwilly · September 13, 2022, 11:05am

Hi all. I have quite a perplexing problem: when num_samples=1 in the ray TuneConfig, then the HPO runs as expected and terminates after 1 trial. But when num_samples=x , with x>1, then the HPO runs indefinitely; it runs as expected for the first x trials, and then keeps training additional runs with the first set of trial params. Also, this only happens when trying to set the resources (CPUs/GPUs). Any ideas?

I’m not running on a cluster
I have ray 2.0.0 installed
seems to only occur when trying to use a GPU

Example: runs in red box are what is supposed to run. All other trials are not supposed to run, and get run with the first trial’s param values:

In case it’s useful…

I’m on Amazon Linux
my training function uses the “function API”:
- starts 3 nested mlflow runs to train 3 models on different validation splits
- averages metric across folds and calls session.report

Tuner looks like

PARAM_SPACE = {
    "dropout": tune.uniform(0.1, 0.5),
    <etc.>
    "mlflow": {
        "experiment_name": experiment_name,
        "tracking_uri": MLFLOW_TRACKING_URI,
    },
}


TUNE_CONFIG = tune.TuneConfig(
    metric=metric,
    mode="min",
    num_samples=n_trials,
    search_alg=OptunaSearch(),
    scheduler=MedianStoppingRule(),
)

tuner = tune.Tuner(
        tune.with_resources(objective, resources={"cpu": n_cpus_trial, "gpu": n_gpus_trial}),
        tune_config=TUNE_CONFIG,
        run_config=air.RunConfig(),
        param_space=PARAM_SPACE,
    )

 tuner.fit()

kai · September 13, 2022, 4:02pm

Can’t reproduce with this script:

import ray
from ray import tune, air
from ray.tune.suggest.optuna import OptunaSearch
from ray.tune.schedulers.median_stopping_rule import MedianStoppingRule

PARAM_SPACE = {
    "dropout": tune.uniform(0.1, 0.5),
}


TUNE_CONFIG = tune.TuneConfig(
    metric="_metric",
    mode="min",
    num_samples=2,
    search_alg=OptunaSearch(),
    scheduler=MedianStoppingRule(),
)


def objective(config):
    return 4


ray.init(num_cpus=4, num_gpus=1)

tuner = tune.Tuner(
        tune.with_resources(objective, resources={"cpu": 1, "gpu": 1}),
        tune_config=TUNE_CONFIG,
        run_config=air.RunConfig(),
        param_space=PARAM_SPACE,
    )

tuner.fit()

Can you try this out and let me know if the problem still comes up? If not, can you adjust the example so that it does?

kai · September 13, 2022, 4:03pm

Generally this looks like an issue on the nested mlflow side. What is the log output from Tune?

bmwilly · September 15, 2022, 2:19pm

@kai unfortunately, I can’t seem to create a simple minimal example (as opposed to my admittedly complicated objective) where I see this behavior, so I will close this for now.

Thank you for the response.

cfoyer · January 23, 2025, 2:14pm

This is 2 years down the line but since this is the only thread I could find, and it doesn’t have an actual solution, I thought I would reply so people can find it easily.

I ran into the same issue trying to run my script in a container with a docker exec ... command and the tuner.fit call would hang before exiting.

I discovered that running the command with pseudo-tty on (aka docker exec -t ...) makes the issue go away. Not sure what the underlying cause is but it fixed the issue for me.

One thing I’m still trying to figure out though is it’s not working if i call it from my airflow container. Is it because it’s not running a TTY session when invoking the script?

Topic		Replies	Views
My training is endless wth tune.run() Ray Tune	8	499	May 3, 2022
Ray Train V2 with Ray Tune does not start another trial after a training run is TERMINATED Ray Train	3	21	April 17, 2025
Ray Tune process hangs - thread synchronization issue? Ray Tune	16	767	February 10, 2023
RayTune gets stuck after completing all trials Ray Tune	1	696	February 11, 2022
Ray tune self terminates at 98 trials consistently Ray Tune	12	1359	March 15, 2023

Tuner.fit() never terminates

Related topics