HyperOptSearch hangs when points_to_evaluate is passed

jdesmarais · September 3, 2024, 2:49pm

Hi,

I have been trying to optimize a model using HyperOptSearch but have not been able to get search running when I pass in existing good model configurations. When I format the points to evaluate dictionary just like my space dictionary, the job just hangs (I launched before the long weekend and it has been hung the entire time without moving forward). If I instead to collapse the dictonary into a single level dictionary using “outer_key/inner_key” strategy the script fails with a “ValueError: HyperOpt encountered a GarbageCollected switch argument.”

my tuning function looks like this:

def tune_direct_model_asha(search_space, num_samples=10, 
                           n_parallel=1,use_gpu=True,
                           resources_per_worker={"CPU": 10, "GPU": 1}, 
                           storage_path="./results", name=None,
                           points_to_evaluate=None):
    num_epochs=50
    scheduler = ASHAScheduler(max_t=num_epochs, 
                              grace_period=2, 
                              reduction_factor=2,
                              metric="val_H_v", 
                              mode="min",)
    
    scaling_config = ScalingConfig(
        num_workers=n_parallel, use_gpu=use_gpu, resources_per_worker=resources_per_worker
    )

    run_config = RunConfig(
        storage_path=storage_path, 
        name=name,
        checkpoint_config=CheckpointConfig(
            num_to_keep=2,
            checkpoint_score_attribute="val_H_v",
            checkpoint_score_order="min",
        ),
        callbacks=[WandbLoggerCallback(project=search_space['wandb_prefix'])]
    )

    # Define a TorchTrainer without hyper-parameters for Tuner
    ray_trainer = TorchTrainer(
        train_direct_model,
        scaling_config=scaling_config,
        run_config=run_config,
    )


    hyperopt_search = HyperOptSearch(
                                    search_space,
                                     metric="val_H_v", 
                                     mode="min",
                                     points_to_evaluate=points_to_evaluate)

    tuner = Tuner(
        ray_trainer,
        tune_config=TuneConfig(
            search_alg=hyperopt_search,
            num_samples=num_samples,
            scheduler=scheduler,
        ),
        # param_space={"train_loop_config":search_space},
    )
    return(tuner.fit())

My space is defined as follows:

space = {'train_loop_config':{
    'Architecture': tune.choice([
        {'A1_MegaBlocks':1,
        'A1_wsets':[tune.randint(3, 50) for i in range(1)],
        'A1_dsets':[1 for i in range(1)],
        'A1_batchsize':32,
        },
        {'A2_MegaBlocks':2,
        'A2_wsets':[tune.randint(3, 50) for i in range(2)],
        'A2_dsets':[tune.randint(1, 4) for i in range(2)],
        'A2_batchsize':32,
        },
        ]
        ),
    'flanking_nt': tune.randint(0, 50),
    'lr': tune.loguniform(1e-9, 1e-1),
    'k': tune.randint(10,40),
    'embed_dim': tune.randint(10,40),
    'weight_decay': tune.loguniform(1e-9, 1e-1),
    'wandb_prefix': name,
    'workers':8,
    'verbose':False,
    'debug':False,
}}

and my
previous model version is defined as follows:

old_model = [
{'train_loop_config':{
        'Architecture': {
            'A2_MegaBlocks':2,
            'A2_wsets':[11,11],
            'A2_dsets':[1,4],
            'A2_batch_size':32,
            },
    'flanking_nt': 30,
    'lr': 0.001,
    'k': 32,
    'embed_dim': 32,
    'weight_decay': 1e-2,
    'wandb_prefix': 'Fast_test_overfitting_14',
    'workers':8,
    'verbose':False,
    'debug':False,
}} # Hung
]

I also tried defining the space using the HyperOptSearch library sampling functions and had the same issues.

When I run sweeps without defining existing points_to_evaluate it runs just fine.

Has anyone seen this issue before/has a suggestion for fixing it?

Thanks!

EDIT: I moved the search_space outer level dictionary nesting from the tuning function to the original definition to improve clarity

Topic		Replies	Views
GarbageCollected , related to points_to_evaluate	1	252	July 26, 2023
Search_alg not getting picked up (HyperOpt) Dashboard, Monitoring & Debugging	1	14	August 28, 2024
How to use non-scalars for points_to_evaluate in Tune / Hyperopt?	3	335	August 22, 2023
Bug: Ray Tune with ASHA hangs infinitely in the last trial Ray Tune	2	157	February 29, 2024
[HyperOpt] rec_eval node is GarbageCollected Ray Tune	2	704	October 19, 2021

HyperOptSearch hangs when points_to_evaluate is passed

Related topics