Hi,
I am trying to understand training_iteration in the context of stopping the trial run. If I don’t specify the stopping condition, the trials run for 100 iterations. If I specify a value lower than 100, say:
analysis = tune.run(
trainable_cls,
scheduler=scheduler,
num_samples=100,
search_alg=algo,
config=config,
stop={"training_iteration": 32},
metric="loss",
mode="min")
The value is respected and the trial ends at 32 iterations. However, if I change this condition to a value above 100 the trials still stop at 100. How can I raise the number of iterations?
Thanks,
Vladimir
Hey @vblagoje, can you confirm from the output (example below) that what you’re seeing is that each of the 100 (num_samples
) trials is stopping at iter
100 when training_iteration
is set higher?
+-----------------------+------------+-------+--------+------------------+
| Trial name | status | loc | iter | total time (s) |
|-----------------------+------------+-------+--------+------------------|
| Trainable_5678d_00000 | TERMINATED | | 200 | 0.000630856 |
+-----------------------+------------+-------+--------+------------------+
Hey @matthewdeng ,
Not every trial run stops at 100 as I am running Bayesian search with the AsyncHyperBandScheduler scheduler. A few of them stopped at 100, many at 1 and a few around 10-15. Here is my full config:
algo = BayesOptSearch(metric="loss", mode="min")
algo = ConcurrencyLimiter(algo, max_concurrent=1)
scheduler = AsyncHyperBandScheduler()
trainable_cls = DistributedTrainableCreator(
distributed_bert_pretraining,
num_workers=config["n_gpu"],
backend="nccl",
num_gpus_per_worker=1)
tune_config = {
"phase1_learning_rate": tune.uniform(1e-5, 1e-3),
"weight_decay": tune.uniform(1e-4, 1e-1),
"warmup_proportion": tune.uniform(0.01, 0.20)
}
config = {**config, **tune_config}
print(f"Config for the tune is {config}")
analysis = tune.run(
trainable_cls,
scheduler=scheduler,
num_samples=50,
search_alg=algo,
config=config,
stop={"training_iteration": 320},
metric="loss",
mode="min")
Thanks in advance,
Vladimir
Thanks for sharing the code! This is because of the default values of AsyncHyperBandScheduler:
time_attr: str = "training_iteration",
max_t: int = 100,
You can increase the number of iterations by passing in a larger max_t
when initializing the AsyncHyperBandScheduler
.
Hey @matthewdeng,
You are right! Thanks again and all the best.