How to create a search space where one key-value depends on another key-value?

I want to create a search space for a neural network with n layers (chosen from [2,3,4,5]) and each layer has random units (chosen from [100, 200, 300])

if I have 2 layers, then the units could be [100, 300], [300, 200], [200, 200] , etc.

if I have 3 layers, then the units could be [100, 200, 100], [300, 100, 200], [200,300,300], etc.

The code could look like

def train_nn(config):
      ...

space = {"n_layers": tune.choice([2,3,4,5]), 
         "n_units": [tune.choice([100, 200,300]) for i in range(space['n_layers'])]
          }

analysis = tune.run(
    train_nn,
    config = space,
)

Here I try to create space['n_units'] by the randomly selected value of space['n_layers']. Obviously there is syntax error:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-e7959554a989> in <module>()
      1 # search space
      2 space = {"n_layers": tune.choice([2,3,4,5]), 
----> 3          "n_units": [tune.choice([100, 200,300]) for i in range(space['n_layers'])]
TypeError: 'Categorical' object cannot be interpreted as an integer

Any advise would be appreciated.

The implement in optuna is as following; what is the corresponding implementation in Ray?

def objective(trial: optuna.Trial):
    num_layers = trial.suggest_int('n_layers', 1, 5)  # `num_layers` is 1, 2, 3, 4, or 5.
    layers, ps = [], []  # define the number of unit of each layer / the ratio of dropout of each layer
    for i in range(n_layers - 1):  # `TabularModel` automatically adds the last layer.
        num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
        p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
        layers.append(num_units); ps.append(p)
        
    emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
    learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, y_range=y_range, metrics=exp_rmspe)
    
    learn.fit_one_cycle(5, 1e-3, wd=0.2)
    return learn.validate()[-1].item()  # Of course you can use the last record of `learn.recorder`.
    
study = optuna.create_study()
study.optimize(objective)
best_trial = study.best_trial

if tune.choice([2,3,4,5]) equals 'np.random.choice([2,3,4,5])`, the solution would be easy:

# first, we define dictionary space
space = {"n_layers": tune.choice([2,3,4,5])}

# then we add an additional items to the space dictionary
space["n_units"] = [tune.choice([100, 200,300]) for i in range(space['n_layers'])]

But unfortunately, tune.choice([2,3,4,5]) does not equal to 'np.random.choice([2,3,4,5]); the former is ray.tune.sample.Categoricalclass while the latter isnumpy.int64`. Therefore the above code will create an error


TypeError: 'Categorical' object cannot be interpreted as an integer

@Paul Maybe you could try this:

https://docs.ray.io/en/latest/tune/api_docs/search_space.html#tune-custom-search

1 Like

Hey @Paul, another way that you can do this is to use our Optuna integration which supports define by run. You can use your already existing Optuna objective function, with the only difference being you need to separate it out into a define function and a run (trainable) function. You would do this like this:


from ray import tune
from ray.tune.suggest.optuna import OptunaSearch

def define_by_run_func(trial: optuna.Trial):
    num_layers = trial.suggest_int('n_layers', 1, 5)  # `num_layers` is 1, 2, 3, 4, or 5.
    for i in range(n_layers - 1):  # `TabularModel` automatically adds the last layer.
        num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
        p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
        
    emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
    return

def trainable(config, checkpoint_dir = None):
    emb_drop = config.pop("emb_drop")
    num_layers = config.pop("n_layers")
    layers, ps = [None]*num_layers, [None]*num_layers
    for k, v in config.items():
        index = int(k.split("_")[-1])
        if "num_units" in k:
            layers[index] = v
        elif "dropout" in k:
            ps[index] = v

    emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
    learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, y_range=y_range, metrics=exp_rmspe)
    
    learn.fit_one_cycle(5, 1e-3, wd=0.2)
    tune.report(loss=learn.validate()[-1].item())

algo = OptunaSearch(
    space=define_by_run_func, metric="loss", mode="min")
analysis = tune.run(
    trainable,
    metric="loss",
    mode="min",
    search_alg=algo,
    num_samples=10,
)

Hope this helps! You can see a full runnable example here - optuna_define_by_run_example — Ray v1.6.0

1 Like

Thanks @Yard1. That is helpful! What do you think about the performance/speed comparing the custom/conditional search space
V.S. the optuna method?

There should be no speed difference between the two methods. Most search algorithms we have implemented in Tune (other than random search) don’t support conditional search spaces through nested dictionaries, so by using Optuna define-by-run you can take advantage of Optuna’s bayesian optimisation, which should give better results than conditional search space with random search

@Yard1 Thanks for the context!

I ran your code and ended up with TuneError:

TuneError                                 Traceback (most recent call last)
<ipython-input-7-4f1d2fdff93e> in <module>()
     54     mode="min",
     55     search_alg=algo,
---> 56     num_samples=600
     57 )

/usr/local/lib/python3.7/dist-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint, _remote)
    553     if incomplete_trials:
    554         if raise_on_failed_trial and not state[signal.SIGINT]:
--> 555             raise TuneError("Trials did not complete", incomplete_trials)
    556         else:
    557             logger.error("Trials did not complete: %s", incomplete_trials)

TuneError: ('Trials did not complete', [trainable_813a0634, trainable_81e8b4ea, trainable_828a5872, trainable_82d22f94, trainable_837b1d5c, trainable_8441da00, trainable_84f2c32e, trainable_85d1a29c, trainable_8672bd8a, trainable_875fbf18, trainable_8800110c

I noticed that there is no return on define_by_run_func(trial: optuna.Trial) function, is it normal? OR it should be something like

def define_by_run_func(trial: optuna.Trial):
    n_layers = trial.suggest_int('n_layers', 1, 5)  # `num_layers` is 1, 2, 3, 4, or 5.
    layers, ps = [], []
    for i in range(n_layers - 1):  # `TabularModel` automatically adds the last layer.
        num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
        p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
        layers.append(num_units)
        ps.append(p)
    emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
    para_dic = {'n_layers':n_layers, 'layers':layers, 'ps':ps, 'emb_drop': emb_drop}
    return para_dic

The lack of return is correct. There should be an actual, in-trainable stack trace above in the output, can you post that?

I can notice that the code I posted doesn’t pass data to the trainable, that can be the cause if you haven’t changed anything.

Yes, I provided the data. Here is the full set of code:

from ray import tune
import optuna
from ray.tune.suggest.optuna import OptunaSearch
from fastai.tabular import * 

# define path
path = untar_data(URLs.ADULT_SAMPLE)

# load data
df = pd.read_csv(path/'adult.csv')

# simple split data into train & valid
valid_idx = range(len(df)-2000, len(df))

# define local variables
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

def define_by_run_func(trial: optuna.Trial):
    """Define-by-run function to create the search space.

    Ensure no actual computation takes place here. That should go into
    the trainable passed to ``tune.run`` (in this example, that's
    ``easy_objective``).

    For more information, see https://optuna.readthedocs.io/en/stable\
/tutorial/10_key_features/002_configurations.html

    This function should either return None or a dict with constant values.
    """
    n_layers = trial.suggest_int('n_layers', 1, 5)  # `num_layers` is 1, 2, 3, 4, or 5.
    #layers, ps = [], []
    for i in range(n_layers - 1):  # `TabularModel` automatically adds the last layer.
        num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
        p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
        #layers.append(num_units)
        #ps.append(p)
    emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
    n_epochs = trial.suggest_categorical('n_epochs', [1,2,4,5,7,9,10])
    #para_dic = {'n_layers':n_layers, 'layers':layers, 'ps':ps, emb_drop:'emb_drop'}
    return
 

def trainable(config, checkpoint_dir = None):
    emb_drop = config.pop("emb_drop")
    num_layers = config.pop("n_layers")
    layers, ps = [None]*num_layers, [None]*num_layers
    for k, v in config.items():
        index = int(k.split("_")[-1])
        if "num_units" in k:
            layers[index] = v
        elif "dropout" in k:
            ps[index] = v
    #metrics
    f1=FBeta()
    precision = Precision()
    recall = Recall()

    # train classifier
    learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, emb_szs={'native-country': 10}, metrics=[accuracy, precision, recall, f1])
    
    # auto find learning rate
    try:
        lr = find_appropriate_lr(model=learn, plot=True)
        print(f'clf uses estimated lr={lr}')
    except:
        lr = 1e-2
        print(f'clf uses pre-defined lr={lr}')
    
    # train n_epoch
    n_epochs = config['n_epochs']
    learn.fit_one_cycle(n_epochs, moms=(lr*0.01,lr))
    
    # build validation performance metrics
    valid_metrics = dict(zip(['accuracy',	'precision',	'recall',	'f1'], [x.item() for x in learn.recorder.metrics[-1]])) # -1 means selecting the last epoch
    
    # send metrics to tune
    tune.report(**valid_metrics)


algo = OptunaSearch(
    space=define_by_run_func, 
    metric="f1", 
    mode="max")

analysis = tune.run(
    trainable,
    metric="f1",
    mode="max",
    search_alg=algo,
    num_samples=600
)

Thanks! Do you have that stack trace? The actual exception that happened in the trainable should be printed out somewhere before the TuneError.

Looking at the code I can see a few minor errors - for example, data is not passed into the trainable. I don’t think you are using valid_idx either.

This code passes the data to the trainable (thought it doesn’t use the valid_idx:

from ray import tune
import optuna
from ray.tune.suggest.optuna import OptunaSearch
from fastai.tabular import * 

# define path
path = untar_data(URLs.ADULT_SAMPLE)

# load data
df = pd.read_csv(path/'adult.csv')

# simple split data into train & valid
valid_idx = range(len(df)-2000, len(df))

# define local variables
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

def define_by_run_func(trial: optuna.Trial):
    """Define-by-run function to create the search space.

    Ensure no actual computation takes place here. That should go into
    the trainable passed to ``tune.run`` (in this example, that's
    ``easy_objective``).

    For more information, see https://optuna.readthedocs.io/en/stable\
/tutorial/10_key_features/002_configurations.html

    This function should either return None or a dict with constant values.
    """
    n_layers = trial.suggest_int('n_layers', 1, 5)  # `num_layers` is 1, 2, 3, 4, or 5.
    #layers, ps = [], []
    for i in range(n_layers - 1):  # `TabularModel` automatically adds the last layer.
        num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
        p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
        #layers.append(num_units)
        #ps.append(p)
    emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
    n_epochs = trial.suggest_categorical('n_epochs', [1,2,4,5,7,9,10])
    #para_dic = {'n_layers':n_layers, 'layers':layers, 'ps':ps, emb_drop:'emb_drop'}
    return
 

def trainable(config, data, checkpoint_dir = None):
    emb_drop = config.pop("emb_drop")
    num_layers = config.pop("n_layers")
    layers, ps = [None]*num_layers, [None]*num_layers
    for k, v in config.items():
        index = int(k.split("_")[-1])
        if "num_units" in k:
            layers[index] = v
        elif "dropout" in k:
            ps[index] = v
    #metrics
    f1=FBeta()
    precision = Precision()
    recall = Recall()

    # train classifier
    learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, emb_szs={'native-country': 10}, metrics=[accuracy, precision, recall, f1])
    
    # auto find learning rate
    try:
        lr = find_appropriate_lr(model=learn, plot=True)
        print(f'clf uses estimated lr={lr}')
    except:
        lr = 1e-2
        print(f'clf uses pre-defined lr={lr}')
    
    # train n_epoch
    n_epochs = config['n_epochs']
    learn.fit_one_cycle(n_epochs, moms=(lr*0.01,lr))
    
    # build validation performance metrics
    valid_metrics = dict(zip(['accuracy',	'precision',	'recall',	'f1'], [x.item() for x in learn.recorder.metrics[-1]])) # -1 means selecting the last epoch
    
    # send metrics to tune
    tune.report(**valid_metrics)


algo = OptunaSearch(
    space=define_by_run_func, 
    metric="f1", 
    mode="max")

analysis = tune.run(
    tune.with_parameters(trainable, data=df),
    metric="f1",
    mode="max",
    search_alg=algo,
    num_samples=600
)

Hi @Yard1, great catch! Actually I mis-copied one line of code about data.
# prep data for tabular_learner() data = TabularDataBunch.from_df(path, df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names) .
Here is the complete set of code:

from ray import tune
import optuna
from ray.tune.suggest.optuna import OptunaSearch
from fastai.tabular import * 

# define path
path = untar_data(URLs.ADULT_SAMPLE)

# load data
df = pd.read_csv(path/'adult.csv')

# simple split data into train & valid
valid_idx = range(len(df)-2000, len(df))

# define local variables
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

#helper functions

def find_appropriate_lr(model, lr_diff=15, loss_threshold=.05, adjust_value=1, plot=False):
    """automatically find the appropriate learning rate
    Args:
        model (learner)
        lr_diff(int, default 15)
        loss_threshold(float, default .05) 
        adjust_value(float, default = 1), 
        plot (bool default= False)
    Return:
        lr (float): optimal learning rate.
    Ref: https://forums.fast.ai/t/automated-learning-rate-suggester/44199 """
    #Run the Learning Rate Finder
    model.lr_find()
    
    #Get loss values and their corresponding gradients, and get lr values
    losses = np.array(model.recorder.losses)
    assert(lr_diff < len(losses))
    loss_grad = np.gradient(losses)
    lrs = model.recorder.lrs
    
    #Search for index in gradients where loss is lowest before the loss spike
    #Initialize right and left idx using the lr_diff as a spacing unit
    #Set the local min lr as -1 to signify if threshold is too low
    r_idx = -1
    l_idx = r_idx - lr_diff
    while (l_idx >= -len(losses)) and (abs(loss_grad[r_idx] - loss_grad[l_idx]) > loss_threshold):
        local_min_lr = lrs[l_idx]
        r_idx -= 1
        l_idx -= 1

    lr_to_use = local_min_lr * adjust_value
    
    if plot:
        # plots the gradients of the losses in respect to the learning rate change
        plt.plot(loss_grad)
        plt.plot(len(losses)+l_idx, loss_grad[l_idx],markersize=10,marker='o',color='red')
        plt.ylabel("Loss")
        plt.xlabel("Index of LRs")
        plt.show()

        plt.plot(np.log10(lrs), losses)
        plt.ylabel("Loss")
        plt.xlabel("Log 10 Transform of Learning Rate")
        loss_coord = np.interp(np.log10(lr_to_use), np.log10(lrs), losses)
        plt.plot(np.log10(lr_to_use), loss_coord, markersize=10,marker='o',color='red')
        plt.show()
        
    return lr_to_use

def define_by_run_func(trial: optuna.Trial):
    """Define-by-run function to create the search space.

    Ensure no actual computation takes place here. That should go into
    the trainable passed to ``tune.run`` (in this example, that's
    ``easy_objective``).

    For more information, see https://optuna.readthedocs.io/en/stable\
/tutorial/10_key_features/002_configurations.html

    This function should either return None or a dict with constant values.
    """
    n_layers = trial.suggest_int('n_layers', 1, 5)  # `num_layers` is 1, 2, 3, 4, or 5.
    #layers, ps = [], []
    for i in range(n_layers - 1):  # `TabularModel` automatically adds the last layer.
        num_units = trial.suggest_categorical(f'num_units_layer_{i}', [800, 900, 1000, 1100, 1200])
        p = trial.suggest_discrete_uniform(f'dropout_p_layer_{i}', 0, 1, 0.05)
        #layers.append(num_units)
        #ps.append(p)
    emb_drop = trial.suggest_discrete_uniform('emb_drop', 0, 1, 0.05)
    n_epochs = trial.suggest_categorical('n_epochs', [1,2,4,5,7,9,10])
    #para_dic = {'n_layers':n_layers, 'layers':layers, 'ps':ps, emb_drop:'emb_drop'}
    return
 

def trainable(config, checkpoint_dir = None):
    emb_drop = config.pop("emb_drop")
    num_layers = config.pop("n_layers")
    layers, ps = [None]*num_layers, [None]*num_layers
    for k, v in config.items():
        index = int(k.split("_")[-1])
        if "num_units" in k:
            layers[index] = v
        elif "dropout" in k:
            ps[index] = v
    #metrics
    f1=FBeta()
    precision = Precision()
    recall = Recall()

    # prep data for tabular_learner()
    data = TabularDataBunch.from_df(path, df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

    # train classifier
    learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, emb_szs={'native-country': 10}, metrics=[accuracy, precision, recall, f1])
    
    # auto find learning rate
    try:
        lr = find_appropriate_lr(model=learn, plot=True)
        print(f'clf uses estimated lr={lr}')
    except:
        lr = 1e-2
        print(f'clf uses pre-defined lr={lr}')
    
    # train n_epoch
    n_epochs = config['n_epochs']
    learn.fit_one_cycle(n_epochs, moms=(lr*0.01,lr))
    
    # build validation performance metrics
    valid_metrics = dict(zip(['accuracy',	'precision',	'recall',	'f1'], [x.item() for x in learn.recorder.metrics[-1]])) # -1 means selecting the last epoch
    
    # send metrics to tune
    tune.report(**valid_metrics)

# hyperparameters tuning by Optuna

algo = OptunaSearch(
    space=define_by_run_func, 
    metric="f1", 
    mode="max")

analysis = tune.run(
    trainable,      #in case trainable has other arguments: tune.with_parameters(trainable, data=df),
    metric="f1",
    mode="max",
    search_alg=algo,
    num_samples=600, 
)

The error message from running the above code was:

---------------------------------------------------------------------------
TuneError                                 Traceback (most recent call last)
<ipython-input-3-33850473a538> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', '# hyperparameters tuning by Optuna\n\nalgo = OptunaSearch(\n    space=define_by_run_func, \n    metric="f1", \n    mode="max")\n\nanalysis = tune.run(\n    trainable,      #in case trainable has other arguments: tune.with_parameters(trainable, data=df),\n    metric="f1",\n    mode="max",\n    search_alg=algo,\n    num_samples=600, \n    \n)')

3 frames
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2115             magic_arg_s = self.var_expand(line, stack_depth)
   2116             with self.builtin_trap:
-> 2117                 result = fn(magic_arg_s, cell)
   2118             return result
   2119 

<decorator-gen-53> in time(self, line, cell, local_ns)

/usr/local/lib/python3.7/dist-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

/usr/local/lib/python3.7/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
   1191         else:
   1192             st = clock2()
-> 1193             exec(code, glob, local_ns)
   1194             end = clock2()
   1195             out = None

<timed exec> in <module>()

/usr/local/lib/python3.7/dist-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint, _remote)
    553     if incomplete_trials:
    554         if raise_on_failed_trial and not state[signal.SIGINT]:
--> 555             raise TuneError("Trials did not complete", incomplete_trials)
    556         else:
    557             logger.error("Trials did not complete: %s", incomplete_trials)

TuneError: ('Trials did not complete', [trainable_a7c07abc, trainable_a7e5d122, trainable_a7f956de, trainable_aa3120b2,
...

What puzzling me is I encounter for many time TuneError: ('Trials did not complete', [trainable_a7c07abc, trainable_a7e5d122, trainable_a7f956de, . What cause it and how to fix it?

There should be a stack trace from inside the trainable that would tell us the exact reason for the trials not completing. That stack trace would be printed out from the cell that was running tune.run. Is it possible for you to share the output from that cell?

The execution of that cell produced 10s thousand lines of output which look like (here is a small subset)

(pid=298) 2021-10-05 19:15:55,548	ERROR function_runner.py:266 -- Runner Thread raised error.
(pid=298) Traceback (most recent call last):
(pid=298)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
(pid=298)     self._entrypoint()
(pid=298)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
(pid=298)     self._status_reporter.get_checkpoint())
(pid=298)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
(pid=298)     output = fn()
(pid=298)   File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=298) ValueError: invalid literal for int() with base 10: 'epochs'
(pid=298) Exception in thread Thread-2:
(pid=298) Traceback (most recent call last):
(pid=298)   File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(pid=298)     self.run()
(pid=298)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 279, in run
(pid=298)     raise e
(pid=298)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
(pid=298)     self._entrypoint()
(pid=298)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
(pid=298)     self._status_reporter.get_checkpoint())
(pid=298)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
(pid=298)     output = fn()
(pid=298)   File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=298) ValueError: invalid literal for int() with base 10: 'epochs'
(pid=298) 
2021-10-05 19:15:55,752	ERROR trial_runner.py:773 -- Trial trainable_a7c07abc: Error processing event.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 739, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1621, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=298, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f9e7c03bf50>)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 178, in train_buffered
    result = self.train()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 237, in train
    result = self.step()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 379, in step
    self._report_thread_runner_error(block=True)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=298, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f9e7c03bf50>)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
    self._entrypoint()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
    output = fn()
  File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
ValueError: invalid literal for int() with base 10: 'epochs'
(pid=299) 2021-10-05 19:15:55,746	ERROR function_runner.py:266 -- Runner Thread raised error.
(pid=299) Traceback (most recent call last):
(pid=299)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
(pid=299)     self._entrypoint()
(pid=299)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
(pid=299)     self._status_reporter.get_checkpoint())
(pid=299)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
(pid=299)     output = fn()
(pid=299)   File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=299) ValueError: invalid literal for int() with base 10: 'epochs'
(pid=299) Exception in thread Thread-2:
(pid=299) Traceback (most recent call last):
(pid=299)   File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(pid=299)     self.run()
(pid=299)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 279, in run
(pid=299)     raise e
(pid=299)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
(pid=299)     self._entrypoint()
(pid=299)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
(pid=299)     self._status_reporter.get_checkpoint())
(pid=299)   File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
(pid=299)     output = fn()
(pid=299)   File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=299) ValueError: invalid literal for int() with base 10: 'epochs'
(pid=299) 
2021-10-05 19:15:55,949	ERROR trial_runner.py:773 -- Trial trainable_a7e5d122: Error processing event.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trial_runner.py", line 739, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/ray_trial_executor.py", line 746, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 1621, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=299, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f3b321f4610>)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 178, in train_buffered
    result = self.train()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 237, in train
    result = self.step()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 379, in step
    self._report_thread_runner_error(block=True)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=299, ip=172.28.0.2, repr=<types.ImplicitFunc object at 0x7f3b321f4610>)
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 260, in run
    self._entrypoint()
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 329, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/usr/local/lib/python3.7/dist-packages/ray/tune/function_runner.py", line 594, in _trainable_func
    output = fn()
  File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
ValueError: invalid literal for int() with base 10: 'epochs'
Result for trainable_a7c07abc:
  {}
  
Result for trainable_a7e5d122:
  {}
  
== Status ==
Memory usage on this node: 1.2/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/2 CPUs, 0/0 GPUs, 0.0/7.31 GiB heap, 0.0/3.66 GiB objects
Result logdir: /root/ray_results/trainable_2021-10-05_19-15-51
Number of trials: 5/600 (2 ERROR, 1 PENDING, 2 RUNNING)
Trial name	status	loc	dropout_p_layer_0	dropout_p_layer_1	dropout_p_layer_2	dropout_p_layer_3	emb_drop	n_epochs	n_layers	num_units_layer_0	num_units_layer_1	num_units_layer_2	num_units_layer_3
trainable_a7f956de	RUNNING						0.7	7	1				
trainable_aa3120b2	RUNNING		0.45	0.7	0.35	0.85	0.4	9	5	1000	1200	900	1000
trainable_aa531690	PENDING		0.5	0.1			0.85	5	3	1000	1200		
trainable_a7c07abc	ERROR		0.5	0.95	0.05	0.7	0.65	9	5	800	800	1000	1000
trainable_a7e5d122	ERROR		0.5	0.1			0	4	3	900	1000		

Number of errored trials: 2

Here is the colab notebook link: Google Colab . I will response right after you request access. Thanks for looking into it!

Hey, I see the issue - I forgot about the epochs param.

(pid=298)   File "<ipython-input-2-edfb4cc8f06c>", line 82, in trainable
(pid=298) ValueError: invalid literal for int() with base 10: 'epochs'

Can you try this in the trainable?

def trainable(config, checkpoint_dir = None):
    emb_drop = config.pop("emb_drop")
    num_layers = config.pop("n_layers")
    n_epochs = config.pop('n_epochs')
    layers, ps = [None]*num_layers, [None]*num_layers
    for k, v in config.items():
        if "num_units" in k:
            index = int(k.split("_")[-1])
            layers[index] = v
        elif "dropout" in k:
            index = int(k.split("_")[-1])
            ps[index] = v
    #metrics
    f1=FBeta()
    precision = Precision()
    recall = Recall()

    # prep data for tabular_learner()
    data = TabularDataBunch.from_df(path, df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

    # train classifier
    learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, emb_szs={'native-country': 10}, metrics=[accuracy, precision, recall, f1])
    
    # auto find learning rate
    try:
        lr = find_appropriate_lr(model=learn, plot=True)
        print(f'clf uses estimated lr={lr}')
    except:
        lr = 1e-2
        print(f'clf uses pre-defined lr={lr}')
    
    # train n_epoch
    learn.fit_one_cycle(n_epochs, moms=(lr*0.01,lr))
    
    # build validation performance metrics
    valid_metrics = dict(zip(['accuracy',	'precision',	'recall',	'f1'], [x.item() for x in learn.recorder.metrics[-1]])) # -1 means selecting the last epoch
    
    # send metrics to tune
    tune.report(**valid_metrics)

Sure! I am running on it.Will give you the update soon. Thank you!

Using the updated code, also a with time budget time_budget_s=600 I ran

analysis = tune.run(
    trainable,      #in case trainable has other arguments: tune.with_parameters(trainable, data=df),
    metric="f1",
    mode="max",
    search_alg=algo,
    num_samples=600, 
    time_budget_s=600
)

it ended up a very similar error messages TuneError: Trials did not complete...

Is it caused by insufficient computational resource e.g. not enough num_samples, time_budget_s? I tried increase the values of those parameters, but none had been working so far. @Yard1 You can run the above code in colab or use the notebook link here Google Colab

Trials not complete means that there was an exception in the trainable. As before, the exception message will be shown in the cell output. I guess we still missed something. I’ll try running it later

Hey @Paul here is the fixed trainable

def trainable(config, checkpoint_dir = None):
    config = config.copy()
    emb_drop = config.pop("emb_drop")
    num_layers = config.pop("n_layers")
    n_epochs = config.pop("n_epochs")
    layers, ps = [None]*(num_layers-1), [None]*(num_layers-1)
    for k, v in config.items():
        index = int(k.split("_")[-1])
        if "num_units" in k:
            layers[index] = v
        elif "dropout" in k:
            ps[index] = v
    #metrics
    f1=FBeta()
    precision = Precision()
    recall = Recall()

    # prep data for tabular_learner()
    procs = [FillMissing, Categorify, Normalize]
    data = TabularDataBunch.from_df(path, df, dep_var=dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

    # train classifier
    learn = tabular_learner(data, layers=layers, ps=ps, emb_drop=emb_drop, emb_szs={'native-country': 10}, metrics=[accuracy, precision, recall, f1])
    
    # auto find learning rate
    try:
        lr = find_appropriate_lr(model=learn, plot=True)
        print(f'clf uses estimated lr={lr}')
    except:
        lr = 1e-2
        print(f'clf uses pre-defined lr={lr}')
    
    # train n_epoch
    learn.fit_one_cycle(n_epochs, moms=(lr*0.01,lr))
    
    # build validation performance metrics
    valid_metrics = dict(zip(['accuracy',	'precision',	'recall',	'f1'], [x.item() for x in learn.recorder.metrics[-1]])) # -1 means selecting the last epoch
    
    # send metrics to tune
    tune.report(**valid_metrics)

@Yard1 Thank you for your update! I ran the code but still have the same error Trials did not complete.... Do you have the same situation when you run the colab notebook?