Hello!
I am posting here regarding an error with Ray tune when trying to perform hyperparameter tuning for a deep learning model. They hyperparameters include batch size, learning rate and a couple of sizes related to the model’s layers.
I use ray’s config dict to instantiate my model with the hyperparameter values of the current ray trial. I cannot complete my hyperparameter tuning due to the following error:
Failure # 1 (occurred at 2023-06-30_13-13-13)
e[36mray::ImplicitFunc.train()e[39m (pid=12048, ip=127.0.0.1, actor_id=854273743e67337fff42291e01000000, repr=train_flavoursynth)
File “python\ray_raylet.pyx”, line 1434, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 1438, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 1378, in ray._raylet.execute_task.function_executor
File “C:\Dev\ae\lib\site-packages\ray_private\function_manager.py”, line 724, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “C:\Dev\ae\lib\site-packages\ray\util\tracing\tracing_helper.py”, line 464, in _resume_span
return method(self, *_args, **_kwargs)
File “C:\Dev\ae\lib\site-packages\ray\tune\trainable\trainable.py”, line 389, in train
raise skipped from exception_cause(skipped)
File “C:\Dev\ae\lib\site-packages\ray\tune\trainable\function_trainable.py”, line 336, in entrypoint
return self._trainable_func(
File “C:\Dev\ae\lib\site-packages\ray\util\tracing\tracing_helper.py”, line 464, in _resume_span
return method(self, *_args, **_kwargs)
File “C:\Dev\ae\lib\site-packages\ray\tune\trainable\function_trainable.py”, line 653, in _trainable_func
output = fn()
…
File “C:\Dev\ae\lib\site-packages\torch\nn\modules\linear.py”, line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x16 and 10x128)
However, when I run my training script with the same hyperparameter settings I do not experience any issues. Does anybody have any idea of what might be going on please?
I am attaching part of the script at the end of this post.
Thank you,
Mary
def prepare_data_and_model(config, opt_path):
“”"Function to create dataloaders and model given the current ray trial settings.
Args:
config (dict): dictionary populated automatically by Ray Tune and corresponding to the hyperparameters selected for the trial from the search space
opt_path (string): full path to the experiment settings json fileReturns: model (FlavourSynth model): FlavourSynth model class object train_dataloader (torch Dataset): training set dataloader valid_dataloader (torch Dataset): validation set dataloder """ # experiment config opt = d.load_json(opt_path) # tune latent size opt['model']['ae_net']['params']['latent_size'] = config['latent_size'] opt['model']['disc']['input_size'] = config['latent_size'] # tune classifier mid size opt['model']['ae_net']['params']['classifier_mid_size'] = config['classifier_mid_size'] # tune discriminator mid size opt['model']['disc']['mid_size'] = config['disc_mid_size'] # tune learning rate opt['train']['lr'] = config['lr'] #tune batch size opt['train']['batch_size'] = config['batch_size'] # Set manual seed for reproducibility torch.manual_seed(opt['train']['seed']) np.random.seed(opt['train']['seed']) # Training set trainset = database(opt['data'], is_train=True) train_dataloader = torch.utils.data.DataLoader( trainset, batch_size=config['batch_size'], shuffle=True, num_workers=0) # Validation set validset = database(opt['data'], is_train=False) valid_dataloader = torch.utils.data.DataLoader( validset, batch_size=1, shuffle=False, num_workers=0) model = Model(opt, is_train=True) return model, train_dataloader, valid_dataloader
def main():
“”“Main function to perform hyperparameter tuning using Ray Tune.
“””
search_space = {
“lr”: tune.grid_search([5e-5, 5e-4, 1e-4, 5e-3]),
“batch_size”: tune.grid_search([4, 8, 10, 12]),
“latent_size”: tune.grid_search([10, 16, 32, 64]),
“classifier_mid_size”: tune.grid_search([64, 128, 256]),
“disc_mid_size”: tune.grid_search([32, 64, 128, 256]),
}tuner = tune.Tuner( tune.with_resources(train_model, {"gpu": 1}), tune_config=tune.TuneConfig( num_samples=1, # number of times to sample from the hyperparameter space max_concurrent_trials=1, # specify max concurrent runs scheduler=ASHAScheduler(metric="recon", mode="min"), chdir_to_trial_dir=False # handling relative paths ), param_space=search_space, ) results = tuner.fit() # Obtain a trial dataframe from all run trials of this `tune.run` call. dfs = {result.log_dir: result.metrics_dataframe for result in results} with open('tune_trial.pickle', 'wb') as file: pickle.dump(dfs, file)