Hello Folks,
I’m using the TuneBOHB search, HyperBandForBOHB scheduler, a ConcurrencyLimiter and the following RunConfig and search space config. The config affects the model structure.
config={
"layer_1": tune.choice([32, 64, 128]),
"layer_2": tune.choice([64, 128, 256]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([32, 64, 128]),
}
RunConfig(
name=exp_name,
verbose=2,
storage_path="./ray_results",
log_to_file=True,
checkpoint_config=CheckpointConfig(
num_to_keep=2,
checkpoint_score_attribute="ptl/val_accuracy",
checkpoint_score_order="max",),)
I have an error when the trainer is restarting from a Paused state.
Resuming training from an AIR checkpoint. ...
....
Restoring states from the checkpoint path at /var/folders/ws/vv4c_tgx1bn762pg20bds4c80000gn/T/checkpoint_tmp_a55dc2a3602246e7a9960b51c415fb7c/model
.........
RuntimeError: Error(s) in loading state_dict for Classifier:
size mismatch for layer_2.weight: copying a param with shape torch.Size([64, 128]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for layer_2.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([256]).
The config for the model is changing. The code is from the example code with a different scheduler and search algorithm.
Any advice is welcome. Thank you