RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x16 and 10x128)

Hello!

I am posting here regarding an error with Ray tune when trying to perform hyperparameter tuning for a deep learning model. They hyperparameters include batch size, learning rate and a couple of sizes related to the model’s layers.

I use ray’s config dict to instantiate my model with the hyperparameter values of the current ray trial. I cannot complete my hyperparameter tuning due to the following error:

Failure # 1 (occurred at 2023-06-30_13-13-13)
e[36mray::ImplicitFunc.train()e[39m (pid=12048, ip=127.0.0.1, actor_id=854273743e67337fff42291e01000000, repr=train_flavoursynth)
File “python\ray_raylet.pyx”, line 1434, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 1438, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 1378, in ray._raylet.execute_task.function_executor
File “C:\Dev\ae\lib\site-packages\ray_private\function_manager.py”, line 724, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “C:\Dev\ae\lib\site-packages\ray\util\tracing\tracing_helper.py”, line 464, in _resume_span
return method(self, *_args, **_kwargs)
File “C:\Dev\ae\lib\site-packages\ray\tune\trainable\trainable.py”, line 389, in train
raise skipped from exception_cause(skipped)
File “C:\Dev\ae\lib\site-packages\ray\tune\trainable\function_trainable.py”, line 336, in entrypoint
return self._trainable_func(
File “C:\Dev\ae\lib\site-packages\ray\util\tracing\tracing_helper.py”, line 464, in _resume_span
return method(self, *_args, **_kwargs)
File “C:\Dev\ae\lib\site-packages\ray\tune\trainable\function_trainable.py”, line 653, in _trainable_func
output = fn()

File “C:\Dev\ae\lib\site-packages\torch\nn\modules\linear.py”, line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x16 and 10x128)

However, when I run my training script with the same hyperparameter settings I do not experience any issues. Does anybody have any idea of what might be going on please?
I am attaching part of the script at the end of this post.

Thank you,
Mary

def prepare_data_and_model(config, opt_path):
“”"Function to create dataloaders and model given the current ray trial settings.
Args:
config (dict): dictionary populated automatically by Ray Tune and corresponding to the hyperparameters selected for the trial from the search space
opt_path (string): full path to the experiment settings json file

Returns:
    model (FlavourSynth model): FlavourSynth model class object
    train_dataloader (torch Dataset): training set dataloader
    valid_dataloader (torch Dataset): validation set dataloder
"""
# experiment config
opt = d.load_json(opt_path)
# tune latent size
opt['model']['ae_net']['params']['latent_size'] = config['latent_size']
opt['model']['disc']['input_size'] = config['latent_size']
# tune classifier mid size
opt['model']['ae_net']['params']['classifier_mid_size'] = config['classifier_mid_size']
# tune discriminator mid size
opt['model']['disc']['mid_size'] = config['disc_mid_size']
# tune learning rate
opt['train']['lr'] = config['lr']
#tune batch size
opt['train']['batch_size'] = config['batch_size']

# Set manual seed for reproducibility
torch.manual_seed(opt['train']['seed'])
np.random.seed(opt['train']['seed'])

# Training set
trainset = database(opt['data'], is_train=True)
train_dataloader = torch.utils.data.DataLoader(
    trainset,
    batch_size=config['batch_size'],
    shuffle=True,
    num_workers=0)

# Validation set
validset = database(opt['data'], is_train=False)
valid_dataloader = torch.utils.data.DataLoader(
    validset,
    batch_size=1,
    shuffle=False,
    num_workers=0)

model = Model(opt, is_train=True)
return model, train_dataloader, valid_dataloader

def main():
“”“Main function to perform hyperparameter tuning using Ray Tune.
“””
search_space = {
“lr”: tune.grid_search([5e-5, 5e-4, 1e-4, 5e-3]),
“batch_size”: tune.grid_search([4, 8, 10, 12]),
“latent_size”: tune.grid_search([10, 16, 32, 64]),
“classifier_mid_size”: tune.grid_search([64, 128, 256]),
“disc_mid_size”: tune.grid_search([32, 64, 128, 256]),
}

tuner = tune.Tuner(
    tune.with_resources(train_model, {"gpu": 1}), 
    tune_config=tune.TuneConfig(
        num_samples=1, # number of times to sample from the hyperparameter space
        max_concurrent_trials=1, # specify max concurrent runs
        scheduler=ASHAScheduler(metric="recon", mode="min"),
        chdir_to_trial_dir=False # handling relative paths
    ),
    param_space=search_space,
)
results = tuner.fit()

# Obtain a trial dataframe from all run trials of this `tune.run` call.
dfs = {result.log_dir: result.metrics_dataframe for result in results}
with open('tune_trial.pickle', 'wb') as file:
    pickle.dump(dfs, file)

Hey @Mary , I suspect there’s an error in your trainable function, where the shape of input tensor doesn’t match with shape of your model’s weight. From the error message, it seems that you are using a input tensor with dim 16, and your linear layer is of shape 10 x 128. It’d be helpful to check the following two things for debugging.

  • the size of the input tensor
  • the shape of the model

Hey, thanks for the response!
I have checked multiple times and could not see any mistake. As I mentioned in my previous post, when using my train script independently from ray, all runs smoothly for the same settings.

Best,
Mary

Have you print the shape of the inputs, and compared that with your previous run? There could be some differences under the distributed settings even you didn’t change your code.

Also, can you share the code for the Trainable so that I can help check with it?

Hey @yunxuanx thanks for the response.

Yes, I have printed the inputs they are correct and updated based on the config.
However I have noticed that my model is not updated based on the config so input size and model size do not match.

Please find attached the trainable function.

def train_model(config):
“”"Function to train a model specified by the current ray trial settings for a couple of epochs
and report the results back to ray.

Args:
    config (dict): dictionary populated automatically by Ray Tune and corresponding to the hyperparameters selected for the trial from the search space
"""
model, train_dataloader, valid_dataloader = prepare_data_and_model(config, OPT_PATH)
for i in range(N_EPOCHS):
    val_loss_accum = train(model, train_dataloader, valid_dataloader)
    session.report(val_loss_accum)
    # save the model to the trial directory every 5 epochs
    if i % 5 == 0:
        model.save_networks(i+1, latest=False)

def train(model, train_dataloader, valid_dataloader):
“”"Function to train the model for one epoch.
Args:
model
train_dataloader (torch Dataset): training set dataloader
valid_dataloader (torch Dataset): validation set dataloader

Returns:
    dict: dictionary where keys are the validation loss names 'recon', 't_class' and 'a_disc'
        and values are the validation loss values
"""
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.AE.to(device)
model.DISC_A.to(device)

# reset iteration count and loss accum at the start of each epoch
train_iter = 0
train_loss_accum = {}

model.train()  # enter training mode
for data in train_dataloader:
    model.set_input(data)  # unpack data from dataloader and move to cuda
    model.optimize_parameters()  # backpropagation and stuff
    # losses: recon, t_class, a_disc, total
    losses = model.get_current_losses()
    # update dict with all loss info
    # keys: t_class, recon, a_disc, total
    train_loss_accum = updated_losses(train_loss_accum, losses, train_iter)
    train_iter = -1 # just a flag for updating losses in updated_losses func

# Validation
# Repeat all the above using the validation set
val_loss_accum = {}
val_iter = 0
model.eval()
for data in valid_dataloader:
    model.set_input(data)
    model.validate()
    losses = model.get_current_losses()
    val_loss_accum = updated_losses(val_loss_accum, losses, val_iter)
    val_iter = -1 # just a flag for updating losses in updated_losses func

# Averaging loss items (update_losses just does the sum)
train_loss_accum = {k: v / train_iter for k, v in train_loss_accum.items()}
val_loss_accum = {k: v / val_iter for k, v in val_loss_accum.items()}

# return all losses except for total
return dict([(loss_name, loss_value) for loss_name, loss_value in val_loss_accum.items() if loss_name!='total'])

Hey @Mary , I am not 100% sure about the meaning of these configs. What’s the input data size and model size? From your code, I guess the input data shape is unchanged (16), and you are tuning the model shape, which would cause inconsistency.

“latent_size”: tune.grid_search([10, 16, 32, 64]), #-> model first layer size?

Hey thanks for the response!

config is the dict generated by Ray which contains the hyperparameter values for the current ray trial. It is the same as provided in the trainable function here.

The model consists of an autoencoder and a fully connected network. Latent size is the encoded size of the autoencoder. This latent is the input to the fully connected network. The error seems to be happening in the fully connected network. It seems that latent size value to be optimized gets updated by config however the fully connected network’s input size is not updated by this statement:

opt[‘model’][‘disc’][‘input_size’] = config[‘latent_size’]

hence the input and model do not match because the model has not been correctly updated by the config. However, I cannot see if there is something wrong in my code. As I said, for the same settings everything works fine outside Ray Tune.

Got it! I think we are close to the issue. Can you print the opt dict here?

def prepare_data_and_model():
    ...
    model = Model(opt, is_train=True) # <- here

Also check if the key name in the opt dict is correctly matched with those in the Model’s init function

Hey! You spotted it , thank you so much!

instead of

opt[‘model’][‘disc’][‘input_size’]

it should be

opt[‘model’][‘disc’][‘params’][‘input_size’]

I spent so much time trying to figure out what is wrong with my ray settings and it turns out it is this silly bug, as usually.

Thank you very much for your help :slight_smile:

Mary

1 Like