Hi,
I have been trying to optimize a model using HyperOptSearch but have not been able to get search running when I pass in existing good model configurations. When I format the points to evaluate dictionary just like my space dictionary, the job just hangs (I launched before the long weekend and it has been hung the entire time without moving forward). If I instead to collapse the dictonary into a single level dictionary using “outer_key/inner_key” strategy the script fails with a “ValueError: HyperOpt encountered a GarbageCollected switch argument.”
my tuning function looks like this:
def tune_direct_model_asha(search_space, num_samples=10,
n_parallel=1,use_gpu=True,
resources_per_worker={"CPU": 10, "GPU": 1},
storage_path="./results", name=None,
points_to_evaluate=None):
num_epochs=50
scheduler = ASHAScheduler(max_t=num_epochs,
grace_period=2,
reduction_factor=2,
metric="val_H_v",
mode="min",)
scaling_config = ScalingConfig(
num_workers=n_parallel, use_gpu=use_gpu, resources_per_worker=resources_per_worker
)
run_config = RunConfig(
storage_path=storage_path,
name=name,
checkpoint_config=CheckpointConfig(
num_to_keep=2,
checkpoint_score_attribute="val_H_v",
checkpoint_score_order="min",
),
callbacks=[WandbLoggerCallback(project=search_space['wandb_prefix'])]
)
# Define a TorchTrainer without hyper-parameters for Tuner
ray_trainer = TorchTrainer(
train_direct_model,
scaling_config=scaling_config,
run_config=run_config,
)
hyperopt_search = HyperOptSearch(
search_space,
metric="val_H_v",
mode="min",
points_to_evaluate=points_to_evaluate)
tuner = Tuner(
ray_trainer,
tune_config=TuneConfig(
search_alg=hyperopt_search,
num_samples=num_samples,
scheduler=scheduler,
),
# param_space={"train_loop_config":search_space},
)
return(tuner.fit())
My space is defined as follows:
space = {'train_loop_config':{
'Architecture': tune.choice([
{'A1_MegaBlocks':1,
'A1_wsets':[tune.randint(3, 50) for i in range(1)],
'A1_dsets':[1 for i in range(1)],
'A1_batchsize':32,
},
{'A2_MegaBlocks':2,
'A2_wsets':[tune.randint(3, 50) for i in range(2)],
'A2_dsets':[tune.randint(1, 4) for i in range(2)],
'A2_batchsize':32,
},
]
),
'flanking_nt': tune.randint(0, 50),
'lr': tune.loguniform(1e-9, 1e-1),
'k': tune.randint(10,40),
'embed_dim': tune.randint(10,40),
'weight_decay': tune.loguniform(1e-9, 1e-1),
'wandb_prefix': name,
'workers':8,
'verbose':False,
'debug':False,
}}
and my
previous model version is defined as follows:
old_model = [
{'train_loop_config':{
'Architecture': {
'A2_MegaBlocks':2,
'A2_wsets':[11,11],
'A2_dsets':[1,4],
'A2_batch_size':32,
},
'flanking_nt': 30,
'lr': 0.001,
'k': 32,
'embed_dim': 32,
'weight_decay': 1e-2,
'wandb_prefix': 'Fast_test_overfitting_14',
'workers':8,
'verbose':False,
'debug':False,
}} # Hung
]
I also tried defining the space using the HyperOptSearch library sampling functions and had the same issues.
When I run sweeps without defining existing points_to_evaluate it runs just fine.
Has anyone seen this issue before/has a suggestion for fixing it?
Thanks!
EDIT: I moved the search_space outer level dictionary nesting from the tuning function to the original definition to improve clarity