Hi, I was tuning Roberta-large using PopulationBasedTraining
, all went well until the last iteration. I used tune.randint(1, 33)
to tune the batch size, but unfortunately in the last iteration the function returned a zero.
code snippet
scheduler = PopulationBasedTraining(
time_attr="training_iteration",
metric="eval_perplexity",
mode="min",
perturbation_interval=perturbation_interval,
require_attrs=True,
hyperparam_mutations={
"learning_rate": tune.loguniform(1e-6, 1e-1),
"per_device_train_batch_size": tune.randint(1, 33),
"warmup_steps": tune.randint(1, 10001),
"weight_decay": tune.uniform(1e-1, 0.6),
"seed": tune.randint(1, 1000),
})
output
+------------------------+------------+-------+-----------------+-------------------------------+--------+----------------+----------------+-------------+
| Trial name | status | loc | learning_rate | per_device_train_batch_size | seed | warmup_steps | weight_decay | objective |
|------------------------+------------+-------+-----------------+-------------------------------+--------+----------------+----------------+-------------|
| _objective_39f1e_00000 | TERMINATED | | 5.89949e-06 | 7 | 260 | 3326 | 0.231658 | 8.25573 |
| _objective_39f1e_00001 | TERMINATED | | 5.65482e-05 | 6 | 669 | 5897 | 0.329833 | 7.01775 |
| _objective_39f1e_00002 | TERMINATED | | 1.46929e-06 | 3 | 385 | 8652 | 0.168605 | 8.32849 |
| _objective_39f1e_00003 | TERMINATED | | 6.78579e-05 | 21 | 535 | 7076 | 0.263866 | 7.14714 |
| _objective_39f1e_00004 | TERMINATED | | 8.89983e-06 | 2 | 353 | 5791 | 0.553634 | 7.87616 |
| _objective_39f1e_00005 | TERMINATED | | 1.06798e-05 | 1 | 423 | 1137 | 0.442907 | 6.53822 |
| _objective_39f1e_00006 | TERMINATED | | 0.0029352 | 5 | 125 | 8119 | 0.554919 | 7.15616 |
| _objective_39f1e_00007 | TERMINATED | | 0.0297656 | 13 | 522 | 7076 | 0.42287 | 7.46451 |
| _objective_39f1e_00008 | TERMINATED | | 5.26616e-06 | 25 | 578 | 3845 | 0.182184 | 7.47882 |
| _objective_39f1e_00009 | TERMINATED | | 4.71235e-05 | 6 | 837 | 7372 | 0.412291 | 7.03242 |
| _objective_39f1e_00010 | TERMINATED | | 7.31494e-06 | 22 | 481 | 3692 | 0.372732 | 8.15146 |
| _objective_39f1e_00011 | TERMINATED | | 0.000180527 | 7 | 157 | 6766 | 0.597569 | 6.87004 |
| _objective_39f1e_00012 | TERMINATED | | 3.76988e-05 | 7 | 216 | 8846 | 0.494749 | 7.15246 |
| _objective_39f1e_00013 | TERMINATED | | 1.49914e-06 | 6 | 161 | 8372 | 0.46626 | 8.85303 |
| _objective_39f1e_00015 | TERMINATED | | 6.78579e-05 | 4 | 610 | 7076 | 0.390384 | 8.67094 |
| _objective_39f1e_00014 | ERROR | | 2.0215e-06 | 0 | 338 | 909 | 0.354326 | 7.68783 |
+------------------------+------------+-------+-----------------+-------------------------------+--------+----------------+----------------+-------------+
error log
Failure # 1 (occurred at 2021-07-08_12-25-54)
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 718, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 688, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/worker.py", line 1495, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ESC[36mray::ImplicitFunc.train_buffered()ESC[39m (pid=65711, ip=104.171.200.139)
File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/trainable.py", line 173, in train_buffered
result = self.train()
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/trainable.py", line 232, in train
result = self.step()
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 366, in step
self._report_thread_runner_error(block=True)
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 512, in _report_thread_runner_error
raise TuneError(
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ESC[36mray::ImplicitFunc.train_buffered()ESC[39m (pid=65711, ip=104.171.200.139)
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run
self._entrypoint()
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint
return self._trainable_func(self.config, self._status_reporter,
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/function_runner.py", line 580, in _trainable_func
output = fn()
File "/home/ubuntu/.local/lib/python3.8/site-packages/ray/tune/utils/trainable.py", line 331, in inner
trainable(config, **fn_kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/integrations.py", line 162, in _objective
local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1084, in train
train_dataloader = self.get_train_dataloader()
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 633, in get_train_dataloader
return DataLoader(
File "/usr/lib/python3/dist-packages/torch/utils/data/dataloader.py", line 272, in __init__
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/usr/lib/python3/dist-packages/torch/utils/data/sampler.py", line 216, in __init__
raise ValueError("batch_size should be a positive integer value, "
ValueError: batch_size should be a positive integer value, but got batch_size=0