Tune.randint returned integer outside specified range

Hi, I was tuning Roberta-large using PopulationBasedTraining, all went well until the last iteration. I used tune.randint(1, 33) to tune the batch size, but unfortunately in the last iteration the function returned a zero.

code snippet

scheduler = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="eval_perplexity",
    mode="min",
    perturbation_interval=perturbation_interval,
    require_attrs=True,
    hyperparam_mutations={
        "learning_rate": tune.loguniform(1e-6, 1e-1),
        "per_device_train_batch_size": tune.randint(1, 33),
        "warmup_steps": tune.randint(1, 10001),
        "weight_decay": tune.uniform(1e-1, 0.6),
        "seed": tune.randint(1, 1000),
    })

output

+------------------------+------------+-------+-----------------+-------------------------------+--------+----------------+----------------+-------------+
| Trial name             | status     | loc   |   learning_rate |   per_device_train_batch_size |   seed |   warmup_steps |   weight_decay |   objective |
|------------------------+------------+-------+-----------------+-------------------------------+--------+----------------+----------------+-------------|
| _objective_39f1e_00000 | TERMINATED |       |     5.89949e-06 |                             7 |    260 |           3326 |       0.231658 |     8.25573 |
| _objective_39f1e_00001 | TERMINATED |       |     5.65482e-05 |                             6 |    669 |           5897 |       0.329833 |     7.01775 |
| _objective_39f1e_00002 | TERMINATED |       |     1.46929e-06 |                             3 |    385 |           8652 |       0.168605 |     8.32849 |
| _objective_39f1e_00003 | TERMINATED |       |     6.78579e-05 |                            21 |    535 |           7076 |       0.263866 |     7.14714 |
| _objective_39f1e_00004 | TERMINATED |       |     8.89983e-06 |                             2 |    353 |           5791 |       0.553634 |     7.87616 |
| _objective_39f1e_00005 | TERMINATED |       |     1.06798e-05 |                             1 |    423 |           1137 |       0.442907 |     6.53822 |
| _objective_39f1e_00006 | TERMINATED |       |     0.0029352   |                             5 |    125 |           8119 |       0.554919 |     7.15616 |
| _objective_39f1e_00007 | TERMINATED |       |     0.0297656   |                            13 |    522 |           7076 |       0.42287  |     7.46451 |
| _objective_39f1e_00008 | TERMINATED |       |     5.26616e-06 |                            25 |    578 |           3845 |       0.182184 |     7.47882 |
| _objective_39f1e_00009 | TERMINATED |       |     4.71235e-05 |                             6 |    837 |           7372 |       0.412291 |     7.03242 |
| _objective_39f1e_00010 | TERMINATED |       |     7.31494e-06 |                            22 |    481 |           3692 |       0.372732 |     8.15146 |
| _objective_39f1e_00011 | TERMINATED |       |     0.000180527 |                             7 |    157 |           6766 |       0.597569 |     6.87004 |
| _objective_39f1e_00012 | TERMINATED |       |     3.76988e-05 |                             7 |    216 |           8846 |       0.494749 |     7.15246 |
| _objective_39f1e_00013 | TERMINATED |       |     1.49914e-06 |                             6 |    161 |           8372 |       0.46626  |     8.85303 |
| _objective_39f1e_00015 | TERMINATED |       |     6.78579e-05 |                             4 |    610 |           7076 |       0.390384 |     8.67094 |
| _objective_39f1e_00014 | ERROR      |       |     2.0215e-06  |                             0 |    338 |            909 |       0.354326 |     7.68783 |
+------------------------+------------+-------+-----------------+-------------------------------+--------+----------------+----------------+-------------+

error log

Failure # 1 (occurred at 2021-07-08_12-25-54)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 718, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 688, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/worker.py", line 1495, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ESC[36mray::ImplicitFunc.train_buffered()ESC[39m (pid=65711, ip=104.171.200.139)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/trainable.py", line 173, in train_buffered
    result = self.train()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/trainable.py", line 232, in train
    result = self.step()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 366, in step
    self._report_thread_runner_error(block=True)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 512, in _report_thread_runner_error
    raise TuneError(
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ESC[36mray::ImplicitFunc.train_buffered()ESC[39m (pid=65711, ip=104.171.200.139)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run
    self._entrypoint()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint
    return self._trainable_func(self.config, self._status_reporter,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 580, in _trainable_func
    output = fn()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/utils/trainable.py", line 331, in inner
    trainable(config, **fn_kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/integrations.py", line 162, in _objective
    local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1084, in train
    train_dataloader = self.get_train_dataloader()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 633, in get_train_dataloader
    return DataLoader(
  File "/usr/lib/python3/dist-packages/torch/utils/data/dataloader.py", line 272, in __init__
    batch_sampler = BatchSampler(sampler, batch_size, drop_last)
  File "/usr/lib/python3/dist-packages/torch/utils/data/sampler.py", line 216, in __init__
    raise ValueError("batch_size should be a positive integer value, "
ValueError: batch_size should be a positive integer value, but got batch_size=0

Hin @vinay_ethiraj,

in PBT during hyperparameter mutation, numeric parameters are multiplied by a constant (0.8 or 1.2) for exploration. This is the default behavior as stated in the original paper. So what likely happens is that one of your trials exploits another trial with a batch size of 1 and then multiplies it with 0.8, leading to a batch size of effectively 0.

There are a couple of things you can do here:

  1. You can specify a custom_explore_fn to PBT that sets the batch size to max(1, config["per_device_train_batch_size"])
  2. You can remove the per_device_train_batch_size from the hyperparam_mutations dict (but pass it to the config parameter of tune.run) - obviously the batch size will then not be part of population based training
  3. You can specify "per_device_train_batch_size": tune.choice([1, 2, 4, 8, 16, 32]) - this makes the valid batch sizes more explicit (which may or may not make sense for you). And since we have discrete values here, PBT will never multiply the value with 0.8 or 1.2 but will then choose a different value instead.

Thank you for the detailed explanation. I think option 3 will be a better match for my problem.