Understanding stop training_iteration parameter

Hi,

I am trying to understand training_iteration in the context of stopping the trial run. If I don’t specify the stopping condition, the trials run for 100 iterations. If I specify a value lower than 100, say:

analysis = tune.run(
        trainable_cls,
        scheduler=scheduler,
        num_samples=100,
        search_alg=algo,
        config=config,
        stop={"training_iteration": 32},
        metric="loss",
        mode="min")

The value is respected and the trial ends at 32 iterations. However, if I change this condition to a value above 100 the trials still stop at 100. How can I raise the number of iterations?

Thanks,
Vladimir

Hey @vblagoje, can you confirm from the output (example below) that what you’re seeing is that each of the 100 (num_samples) trials is stopping at iter 100 when training_iteration is set higher?

+-----------------------+------------+-------+--------+------------------+
| Trial name            | status     | loc   |   iter |   total time (s) |
|-----------------------+------------+-------+--------+------------------|
| Trainable_5678d_00000 | TERMINATED |       |    200 |      0.000630856 |
+-----------------------+------------+-------+--------+------------------+

Hey @matthewdeng ,

Not every trial run stops at 100 as I am running Bayesian search with the AsyncHyperBandScheduler scheduler. A few of them stopped at 100, many at 1 and a few around 10-15. Here is my full config:

    algo = BayesOptSearch(metric="loss", mode="min")
    algo = ConcurrencyLimiter(algo, max_concurrent=1)
    scheduler = AsyncHyperBandScheduler()
    trainable_cls = DistributedTrainableCreator(
        distributed_bert_pretraining,
        num_workers=config["n_gpu"],
        backend="nccl",
        num_gpus_per_worker=1)

    tune_config = {
        "phase1_learning_rate": tune.uniform(1e-5, 1e-3),
        "weight_decay": tune.uniform(1e-4, 1e-1),
        "warmup_proportion": tune.uniform(0.01, 0.20)
        }
    config = {**config, **tune_config}
    print(f"Config for the tune is {config}")
    analysis = tune.run(
        trainable_cls,
        scheduler=scheduler,
        num_samples=50,
        search_alg=algo,
        config=config,
        stop={"training_iteration": 320},
        metric="loss",
        mode="min")

Thanks in advance,
Vladimir

Thanks for sharing the code! This is because of the default values of AsyncHyperBandScheduler:

                 time_attr: str = "training_iteration",
                 max_t: int = 100,

You can increase the number of iterations by passing in a larger max_t when initializing the AsyncHyperBandScheduler.

Hey @matthewdeng,
You are right! Thanks again and all the best.