How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello, my model is created from two neural network one is Vnet which supplies segmentation masks second is modelRegression which count the class instances in the segmentation map.
both are kept in the lightning model class
self.net = net
self.modelRegression = UNetToRegresion(2,regression_channels)
However when I try the checkpointing the load_state_dict function gives error, what can I do?
trainer.lightning_module.load_state_dict(state_dict)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
Missing key(s) in state_dict: "modelRegression.model.0.weight_fake_quant.min_vals", "modelRegression.model.0.weight_fake_quant.max_vals",
strangely error occur both when using TuneReportCallback and TuneReportCheckpointCallback
full error
2022-09-22 15:25:20,952 ERROR tune.py:754 -- Trials did not complete: [mainTrain_722c5_00000]
2022-09-22 15:25:20,952 INFO tune.py:758 -- Total run time: 569.10 seconds (568.96 seconds for the tuning loop).
At least one trial failed.
The trial had an error: ray::ImplicitFunc.train() (pid=277, ip=10.164.0.3, repr=mainTrain)
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable/trainable.py", line 347, in train
result = self.step()
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable/function_trainable.py", line 417, in step
self._report_thread_runner_error(block=True)
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable/function_trainable.py", line 589, in _report_thread_runner_error
raise e
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable/function_trainable.py", line 289, in run
self._entrypoint()
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable/function_trainable.py", line 362, in entrypoint
return self._trainable_func(
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable/function_trainable.py", line 684, in _trainable_func
output = fn()
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable/util.py", line 359, in inner
trainable(config, **fn_kwargs)
File "/home/sliceruser/data/piCaiCode/Three_chan_baseline.py", line 165, in mainTrain
ThreeChanNoExperiment.train_model(label_name, dummyLabelPath, df,percentSplit,cacheDir
File "/home/sliceruser/data/piCaiCode/ThreeChanNoExperiment.py", line 248, in train_model
trainer.fit(model=model, datamodule=data)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray_lightning/launchers/ray_launcher.py", line 64, in launch
self._recover_results_in_main_process(ray_output, trainer)
File "/usr/local/lib/python3.8/dist-packages/ray_lightning/launchers/ray_launcher.py", line 370, in _recover_results_in_main_process
trainer.lightning_module.load_state_dict(state_dict)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
Missing key(s) in state_dict: "modelRegression.model.0.weight_fake_quant.min_vals", "modelRegression.model.0.weight_fake_quant.max_vals", "modelRegression.model.1.weight_fake_quant.min_vals", "modelRegression.model.1.weight_fake_quant.max_vals", "modelRegression.model.2.weight_fake_quant.min_vals", "modelRegression.model.2.weight_fake_quant.max_vals", "modelRegression.model.3.weight_fake_quant.min_vals", "modelRegression.model.3.weight_fake_quant.max_vals".
size mismatch for modelRegression.model.0.weight_fake_quant.min_val: copying a param with shape torch.Size([10]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for modelRegression.model.0.weight_fake_quant.max_val: copying a param with shape torch.Size([10]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for modelRegression.model.1.weight_fake_quant.min_val: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for modelRegression.model.1.weight_fake_quant.max_val: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for modelRegression.model.2.weight_fake_quant.min_val: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for modelRegression.model.2.weight_fake_quant.max_val: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for modelRegression.model.3.weight_fake_quant.min_val: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([0]).
size mismatch for modelRegression.model.3.weight_fake_quant.max_val: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([0]).