Tuner.fit() never terminates

Hi all. I have quite a perplexing problem: when num_samples=1 in the ray TuneConfig, then the HPO runs as expected and terminates after 1 trial. But when num_samples=x , with x>1, then the HPO runs indefinitely; it runs as expected for the first x trials, and then keeps training additional runs with the first set of trial params. Also, this only happens when trying to set the resources (CPUs/GPUs). Any ideas?

  • I’m not running on a cluster
  • I have ray 2.0.0 installed
  • seems to only occur when trying to use a GPU

Example: runs in red box are what is supposed to run. All other trials are not supposed to run, and get run with the first trial’s param values:

In case it’s useful…

  • I’m on Amazon Linux
  • my training function uses the “function API”:
    • starts 3 nested mlflow runs to train 3 models on different validation splits
    • averages metric across folds and calls session.report

Tuner looks like

PARAM_SPACE = {
    "dropout": tune.uniform(0.1, 0.5),
    <etc.>
    "mlflow": {
        "experiment_name": experiment_name,
        "tracking_uri": MLFLOW_TRACKING_URI,
    },
}


TUNE_CONFIG = tune.TuneConfig(
    metric=metric,
    mode="min",
    num_samples=n_trials,
    search_alg=OptunaSearch(),
    scheduler=MedianStoppingRule(),
)

tuner = tune.Tuner(
        tune.with_resources(objective, resources={"cpu": n_cpus_trial, "gpu": n_gpus_trial}),
        tune_config=TUNE_CONFIG,
        run_config=air.RunConfig(),
        param_space=PARAM_SPACE,
    )

 tuner.fit()

Can’t reproduce with this script:

import ray
from ray import tune, air
from ray.tune.suggest.optuna import OptunaSearch
from ray.tune.schedulers.median_stopping_rule import MedianStoppingRule

PARAM_SPACE = {
    "dropout": tune.uniform(0.1, 0.5),
}


TUNE_CONFIG = tune.TuneConfig(
    metric="_metric",
    mode="min",
    num_samples=2,
    search_alg=OptunaSearch(),
    scheduler=MedianStoppingRule(),
)


def objective(config):
    return 4


ray.init(num_cpus=4, num_gpus=1)

tuner = tune.Tuner(
        tune.with_resources(objective, resources={"cpu": 1, "gpu": 1}),
        tune_config=TUNE_CONFIG,
        run_config=air.RunConfig(),
        param_space=PARAM_SPACE,
    )

tuner.fit()

Can you try this out and let me know if the problem still comes up? If not, can you adjust the example so that it does?

Generally this looks like an issue on the nested mlflow side. What is the log output from Tune?

@kai unfortunately, I can’t seem to create a simple minimal example (as opposed to my admittedly complicated objective) where I see this behavior, so I will close this for now.

Thank you for the response.