Ray Tune error resuming training from an AIR checkpoint

f2010126 · September 6, 2023, 5:02pm

Hello Folks,

I’m using the TuneBOHB search, HyperBandForBOHB scheduler, a ConcurrencyLimiter and the following RunConfig and search space config. The config affects the model structure.

config={
            "layer_1": tune.choice([32, 64, 128]),
            "layer_2": tune.choice([64, 128, 256]),
            "lr": tune.loguniform(1e-4, 1e-1),
            "batch_size": tune.choice([32, 64, 128]),
        }

RunConfig(
            name=exp_name,
            verbose=2,
            storage_path="./ray_results",
            log_to_file=True,
            checkpoint_config=CheckpointConfig(
        num_to_keep=2,
        checkpoint_score_attribute="ptl/val_accuracy",
        checkpoint_score_order="max",),)

I have an error when the trainer is restarting from a Paused state.

Resuming training from an AIR checkpoint. ...
....
Restoring states from the checkpoint path at /var/folders/ws/vv4c_tgx1bn762pg20bds4c80000gn/T/checkpoint_tmp_a55dc2a3602246e7a9960b51c415fb7c/model
.........
RuntimeError: Error(s) in loading state_dict for Classifier:
	size mismatch for layer_2.weight: copying a param with shape torch.Size([64, 128]) from checkpoint, the shape in current model is torch.Size([256, 128]).
	size mismatch for layer_2.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([256]).

The config for the model is changing. The code is from the example code with a different scheduler and search algorithm.

Any advice is welcome. Thank you

Topic		Replies	Views
Tuner.fit().get_best_result has no checkpoints (None) Ray Tune	4	620	August 26, 2024
Re-train algorithm from checkpoint with tuner.fit()	3	365	July 18, 2023
Ray Tune Error Help Ray Tune	1	349	October 26, 2021
Possibly Checkpoint error while running Ray tune	4	1230	December 2, 2022
Trouble with some results from Ray Tune	1	42	August 7, 2024

Ray Tune error resuming training from an AIR checkpoint

Related topics