I want to create a search space for a neural network with n layers (chosen from [2,3,4,5]) and each layer has random units (chosen from [100, 200, 300])
if I have 2 layers, then the units could be [100, 300], [300, 200], [200, 200] , etc.
if I have 3 layers, then the units could be [100, 200, 100], [300, 100, 200], [200,300,300], etc.
The code could look like
def train_nn(config):
...
space = {"n_layers": tune.choice([2,3,4,5]),
"n_units": [tune.choice([100, 200,300]) for i in range(space['n_layers'])]
}
analysis = tune.run(
train_nn,
config = space,
)
Here I try to create space['n_units'] by the randomly selected value of space['n_layers']. Obviously there is syntax error:
TypeError Traceback (most recent call last)
<ipython-input-13-e7959554a989> in <module>()
1 # search space
2 space = {"n_layers": tune.choice([2,3,4,5]),
----> 3 "n_units": [tune.choice([100, 200,300]) for i in range(space['n_layers'])]
TypeError: 'Categorical' object cannot be interpreted as an integer
The implement in optuna is as following; what is the corresponding implementation in Ray?
def objective(trial: optuna.Trial):
num_layers = trial.suggest_int('n_layers', 1, 5) # `num_layers` is 1, 2, 3, 4, or 5.
layers, ps = [], [] # define the number of unit of each layer / the ratio of dropout of each layer
for i in range(n_layers - 1): # `TabularModel` automatically adds the last layer.
num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
layers.append(num_units); ps.append(p)
emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, y_range=y_range, metrics=exp_rmspe)
learn.fit_one_cycle(5, 1e-3, wd=0.2)
return learn.validate()[-1].item() # Of course you can use the last record of `learn.recorder`.
study = optuna.create_study()
study.optimize(objective)
best_trial = study.best_trial
if tune.choice([2,3,4,5]) equals 'np.random.choice([2,3,4,5])`, the solution would be easy:
# first, we define dictionary space
space = {"n_layers": tune.choice([2,3,4,5])}
# then we add an additional items to the space dictionary
space["n_units"] = [tune.choice([100, 200,300]) for i in range(space['n_layers'])]
But unfortunately, tune.choice([2,3,4,5]) does not equal to 'np.random.choice([2,3,4,5]); the former is ray.tune.sample.Categoricalclass while the latter isnumpy.int64`. Therefore the above code will create an error
TypeError: 'Categorical' object cannot be interpreted as an integer
Hey @Paul, another way that you can do this is to use our Optuna integration which supports define by run. You can use your already existing Optuna objective function, with the only difference being you need to separate it out into a define function and a run (trainable) function. You would do this like this:
from ray import tune
from ray.tune.suggest.optuna import OptunaSearch
def define_by_run_func(trial: optuna.Trial):
num_layers = trial.suggest_int('n_layers', 1, 5) # `num_layers` is 1, 2, 3, 4, or 5.
for i in range(n_layers - 1): # `TabularModel` automatically adds the last layer.
num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
return
def trainable(config, checkpoint_dir = None):
emb_drop = config.pop("emb_drop")
num_layers = config.pop("n_layers")
layers, ps = [None]*num_layers, [None]*num_layers
for k, v in config.items():
index = int(k.split("_")[-1])
if "num_units" in k:
layers[index] = v
elif "dropout" in k:
ps[index] = v
emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, y_range=y_range, metrics=exp_rmspe)
learn.fit_one_cycle(5, 1e-3, wd=0.2)
tune.report(loss=learn.validate()[-1].item())
algo = OptunaSearch(
space=define_by_run_func, metric="loss", mode="min")
analysis = tune.run(
trainable,
metric="loss",
mode="min",
search_alg=algo,
num_samples=10,
)
There should be no speed difference between the two methods. Most search algorithms we have implemented in Tune (other than random search) don’t support conditional search spaces through nested dictionaries, so by using Optuna define-by-run you can take advantage of Optuna’s bayesian optimisation, which should give better results than conditional search space with random search
Hi @Yard1, great catch! Actually I mis-copied one line of code about data. # prep data for tabular_learner() data = TabularDataBunch.from_df(path, df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names) .
Here is the complete set of code:
from ray import tune
import optuna
from ray.tune.suggest.optuna import OptunaSearch
from fastai.tabular import *
# define path
path = untar_data(URLs.ADULT_SAMPLE)
# load data
df = pd.read_csv(path/'adult.csv')
# simple split data into train & valid
valid_idx = range(len(df)-2000, len(df))
# define local variables
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
#helper functions
def find_appropriate_lr(model, lr_diff=15, loss_threshold=.05, adjust_value=1, plot=False):
"""automatically find the appropriate learning rate
Args:
model (learner)
lr_diff(int, default 15)
loss_threshold(float, default .05)
adjust_value(float, default = 1),
plot (bool default= False)
Return:
lr (float): optimal learning rate.
Ref: https://forums.fast.ai/t/automated-learning-rate-suggester/44199 """
#Run the Learning Rate Finder
model.lr_find()
#Get loss values and their corresponding gradients, and get lr values
losses = np.array(model.recorder.losses)
assert(lr_diff < len(losses))
loss_grad = np.gradient(losses)
lrs = model.recorder.lrs
#Search for index in gradients where loss is lowest before the loss spike
#Initialize right and left idx using the lr_diff as a spacing unit
#Set the local min lr as -1 to signify if threshold is too low
r_idx = -1
l_idx = r_idx - lr_diff
while (l_idx >= -len(losses)) and (abs(loss_grad[r_idx] - loss_grad[l_idx]) > loss_threshold):
local_min_lr = lrs[l_idx]
r_idx -= 1
l_idx -= 1
lr_to_use = local_min_lr * adjust_value
if plot:
# plots the gradients of the losses in respect to the learning rate change
plt.plot(loss_grad)
plt.plot(len(losses)+l_idx, loss_grad[l_idx],markersize=10,marker='o',color='red')
plt.ylabel("Loss")
plt.xlabel("Index of LRs")
plt.show()
plt.plot(np.log10(lrs), losses)
plt.ylabel("Loss")
plt.xlabel("Log 10 Transform of Learning Rate")
loss_coord = np.interp(np.log10(lr_to_use), np.log10(lrs), losses)
plt.plot(np.log10(lr_to_use), loss_coord, markersize=10,marker='o',color='red')
plt.show()
return lr_to_use
def define_by_run_func(trial: optuna.Trial):
"""Define-by-run function to create the search space.
Ensure no actual computation takes place here. That should go into
the trainable passed to ``tune.run`` (in this example, that's
``easy_objective``).
For more information, see https://optuna.readthedocs.io/en/stable\
/tutorial/10_key_features/002_configurations.html
This function should either return None or a dict with constant values.
"""
n_layers = trial.suggest_int('n_layers', 1, 5) # `num_layers` is 1, 2, 3, 4, or 5.
#layers, ps = [], []
for i in range(n_layers - 1): # `TabularModel` automatically adds the last layer.
num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
#layers.append(num_units)
#ps.append(p)
emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
n_epochs = trial.suggest_categorical('n_epochs', [1,2,4,5,7,9,10])
#para_dic = {'n_layers':n_layers, 'layers':layers, 'ps':ps, emb_drop:'emb_drop'}
return
def trainable(config, checkpoint_dir = None):
emb_drop = config.pop("emb_drop")
num_layers = config.pop("n_layers")
layers, ps = [None]*num_layers, [None]*num_layers
for k, v in config.items():
index = int(k.split("_")[-1])
if "num_units" in k:
layers[index] = v
elif "dropout" in k:
ps[index] = v
#metrics
f1=FBeta()
precision = Precision()
recall = Recall()
# prep data for tabular_learner()
data = TabularDataBunch.from_df(path, df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)
# train classifier
learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, emb_szs={'native-country': 10}, metrics=[accuracy, precision, recall, f1])
# auto find learning rate
try:
lr = find_appropriate_lr(model=learn, plot=True)
print(f'clf uses estimated lr={lr}')
except:
lr = 1e-2
print(f'clf uses pre-defined lr={lr}')
# train n_epoch
n_epochs = config['n_epochs']
learn.fit_one_cycle(n_epochs, moms=(lr*0.01,lr))
# build validation performance metrics
valid_metrics = dict(zip(['accuracy', 'precision', 'recall', 'f1'], [x.item() for x in learn.recorder.metrics[-1]])) # -1 means selecting the last epoch
# send metrics to tune
tune.report(**valid_metrics)
# hyperparameters tuning by Optuna
algo = OptunaSearch(
space=define_by_run_func,
metric="f1",
mode="max")
analysis = tune.run(
trainable, #in case trainable has other arguments: tune.with_parameters(trainable, data=df),
metric="f1",
mode="max",
search_alg=algo,
num_samples=600,
)
The error message from running the above code was:
---------------------------------------------------------------------------
TuneError Traceback (most recent call last)
<ipython-input-3-33850473a538> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', '# hyperparameters tuning by Optuna\n\nalgo = OptunaSearch(\n space=define_by_run_func, \n metric="f1", \n mode="max")\n\nanalysis = tune.run(\n trainable, #in case trainable has other arguments: tune.with_parameters(trainable, data=df),\n metric="f1",\n mode="max",\n search_alg=algo,\n num_samples=600, \n \n)')
3 frames
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2115 magic_arg_s = self.var_expand(line, stack_depth)
2116 with self.builtin_trap:
-> 2117 result = fn(magic_arg_s, cell)
2118 return result
2119
<decorator-gen-53> in time(self, line, cell, local_ns)
/usr/local/lib/python3.7/dist-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):
/usr/local/lib/python3.7/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
1191 else:
1192 st = clock2()
-> 1193 exec(code, glob, local_ns)
1194 end = clock2()
1195 out = None
<timed exec> in <module>()
/usr/local/lib/python3.7/dist-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint, _remote)
553 if incomplete_trials:
554 if raise_on_failed_trial and not state[signal.SIGINT]:
--> 555 raise TuneError("Trials did not complete", incomplete_trials)
556 else:
557 logger.error("Trials did not complete: %s", incomplete_trials)
TuneError: ('Trials did not complete', [trainable_a7c07abc, trainable_a7e5d122, trainable_a7f956de, trainable_aa3120b2,
...
What puzzling me is I encounter for many time TuneError: ('Trials did not complete', [trainable_a7c07abc, trainable_a7e5d122, trainable_a7f956de, . What cause it and how to fix it?
There should be a stack trace from inside the trainable that would tell us the exact reason for the trials not completing. That stack trace would be printed out from the cell that was running tune.run. Is it possible for you to share the output from that cell?
The execution of that cell produced 10s thousand lines of output which look like (here is a small subset)
(pid=298) 2021-10-05 19:15:55,548 ERROR function_runner.py:266 -- Runner Thread raised error.
(pid=298) Traceback (most recent call last):
(pid=298) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
(pid=298) self._entrypoint()
(pid=298) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
(pid=298) self._status_reporter.get_checkpoint())
(pid=298) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
(pid=298) output = fn()
(pid=298) File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=298) ValueError: invalid literal for int() with base 10: 'epochs'
(pid=298) Exception in thread Thread-2:
(pid=298) Traceback (most recent call last):
(pid=298) File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(pid=298) self.run()
(pid=298) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 279, in run
(pid=298) raise e
(pid=298) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
(pid=298) self._entrypoint()
(pid=298) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
(pid=298) self._status_reporter.get_checkpoint())
(pid=298) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
(pid=298) output = fn()
(pid=298) File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=298) ValueError: invalid literal for int() with base 10: 'epochs'
(pid=298)
2021-10-05 19:15:55,752 ERROR trial_runner.py:773 -- Trial trainable_a7c07abc: Error processing event.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 739, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1621, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=298, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f9e7c03bf50>)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 178, in train_buffered
result = self.train()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 237, in train
result = self.step()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 379, in step
self._report_thread_runner_error(block=True)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error
("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=298, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f9e7c03bf50>)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
self._entrypoint()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
self._status_reporter.get_checkpoint())
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
output = fn()
File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
ValueError: invalid literal for int() with base 10: 'epochs'
(pid=299) 2021-10-05 19:15:55,746 ERROR function_runner.py:266 -- Runner Thread raised error.
(pid=299) Traceback (most recent call last):
(pid=299) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
(pid=299) self._entrypoint()
(pid=299) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
(pid=299) self._status_reporter.get_checkpoint())
(pid=299) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
(pid=299) output = fn()
(pid=299) File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=299) ValueError: invalid literal for int() with base 10: 'epochs'
(pid=299) Exception in thread Thread-2:
(pid=299) Traceback (most recent call last):
(pid=299) File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(pid=299) self.run()
(pid=299) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 279, in run
(pid=299) raise e
(pid=299) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
(pid=299) self._entrypoint()
(pid=299) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
(pid=299) self._status_reporter.get_checkpoint())
(pid=299) File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
(pid=299) output = fn()
(pid=299) File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=299) ValueError: invalid literal for int() with base 10: 'epochs'
(pid=299)
2021-10-05 19:15:55,949 ERROR trial_runner.py:773 -- Trial trainable_a7e5d122: Error processing event.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 739, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1621, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=299, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f3b321f4610>)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 178, in train_buffered
result = self.train()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 237, in train
result = self.step()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 379, in step
self._report_thread_runner_error(block=True)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error
("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=299, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f3b321f4610>)
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
self._entrypoint()
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
self._status_reporter.get_checkpoint())
File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
output = fn()
File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
ValueError: invalid literal for int() with base 10: 'epochs'
Result for trainable_a7c07abc:
{}
Result for trainable_a7e5d122:
{}
== Status ==
Memory usage on this node: 1.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects
Result logdir: /root/ray_results/trainable_2021-10-05_19-15-51
Number of trials: 5/600 (2 ERROR, 1 PENDING, 2 RUNNING)
Trial name status loc dropout_p_layer_0 dropout_p_layer_1 dropout_p_layer_2 dropout_p_layer_3 emb_drop n_epochs n_layers num_units_layer_0 num_units_layer_1 num_units_layer_2 num_units_layer_3
trainable_a7f956de RUNNING 0.7 7 1
trainable_aa3120b2 RUNNING 0.45 0.7 0.35 0.85 0.4 9 5 1000 1200 900 1000
trainable_aa531690 PENDING 0.5 0.1 0.85 5 3 1000 1200
trainable_a7c07abc ERROR 0.5 0.95 0.05 0.7 0.65 9 5 800 800 1000 1000
trainable_a7e5d122 ERROR 0.5 0.1 0 4 3 900 1000
Number of errored trials: 2
Here is the colab notebook link: Google Colab . I will response right after you request access. Thanks for looking into it!
Using the updated code, also a with time budget time_budget_s=600 I ran
analysis = tune.run(
trainable, #in case trainable has other arguments: tune.with_parameters(trainable, data=df),
metric="f1",
mode="max",
search_alg=algo,
num_samples=600,
time_budget_s=600
)
it ended up a very similar error messages TuneError: Trials did not complete...
Is it caused by insufficient computational resource e.g. not enough num_samples, time_budget_s? I tried increase the values of those parameters, but none had been working so far. @Yard1 You can run the above code in colab or use the notebook link here Google Colab
Trials not complete means that there was an exception in the trainable. As before, the exception message will be shown in the cell output. I guess we still missed something. I’ll try running it later
@Yard1 Thank you for your update! I ran the code but still have the same error Trials did not complete.... Do you have the same situation when you run the colab notebook?