Tuner.fit() never terminates

Hi all. I have quite a perplexing problem: when num_samples=1 in the ray TuneConfig, then the HPO runs as expected and terminates after 1 trial. But when num_samples=x , with x>1, then the HPO runs indefinitely; it runs as expected for the first x trials, and then keeps training additional runs with the first set of trial params. Also, this only happens when trying to set the resources (CPUs/GPUs). Any ideas?

  • I’m not running on a cluster
  • I have ray 2.0.0 installed
  • seems to only occur when trying to use a GPU

Example: runs in red box are what is supposed to run. All other trials are not supposed to run, and get run with the first trial’s param values:

In case it’s useful…

  • I’m on Amazon Linux
  • my training function uses the “function API”:
    • starts 3 nested mlflow runs to train 3 models on different validation splits
    • averages metric across folds and calls session.report

Tuner looks like

PARAM_SPACE = {
    "dropout": tune.uniform(0.1, 0.5),
    <etc.>
    "mlflow": {
        "experiment_name": experiment_name,
        "tracking_uri": MLFLOW_TRACKING_URI,
    },
}


TUNE_CONFIG = tune.TuneConfig(
    metric=metric,
    mode="min",
    num_samples=n_trials,
    search_alg=OptunaSearch(),
    scheduler=MedianStoppingRule(),
)

tuner = tune.Tuner(
        tune.with_resources(objective, resources={"cpu": n_cpus_trial, "gpu": n_gpus_trial}),
        tune_config=TUNE_CONFIG,
        run_config=air.RunConfig(),
        param_space=PARAM_SPACE,
    )

 tuner.fit()

Can’t reproduce with this script:

import ray
from ray import tune, air
from ray.tune.suggest.optuna import OptunaSearch
from ray.tune.schedulers.median_stopping_rule import MedianStoppingRule

PARAM_SPACE = {
    "dropout": tune.uniform(0.1, 0.5),
}


TUNE_CONFIG = tune.TuneConfig(
    metric="_metric",
    mode="min",
    num_samples=2,
    search_alg=OptunaSearch(),
    scheduler=MedianStoppingRule(),
)


def objective(config):
    return 4


ray.init(num_cpus=4, num_gpus=1)

tuner = tune.Tuner(
        tune.with_resources(objective, resources={"cpu": 1, "gpu": 1}),
        tune_config=TUNE_CONFIG,
        run_config=air.RunConfig(),
        param_space=PARAM_SPACE,
    )

tuner.fit()

Can you try this out and let me know if the problem still comes up? If not, can you adjust the example so that it does?

Generally this looks like an issue on the nested mlflow side. What is the log output from Tune?

@kai unfortunately, I can’t seem to create a simple minimal example (as opposed to my admittedly complicated objective) where I see this behavior, so I will close this for now.

Thank you for the response.

This is 2 years down the line but since this is the only thread I could find, and it doesn’t have an actual solution, I thought I would reply so people can find it easily.

I ran into the same issue trying to run my script in a container with a docker exec ... command and the tuner.fit call would hang before exiting.

I discovered that running the command with pseudo-tty on (aka docker exec -t ...) makes the issue go away. Not sure what the underlying cause is but it fixed the issue for me.

One thing I’m still trying to figure out though is it’s not working if i call it from my airflow container. Is it because it’s not running a TTY session when invoking the script?