Hyperparameter search with tune for multiagent environments

I’m trying to do a hyperparameter search using tune for a custom multiagent environment. All the agents should share the same network architecture, but they are trained as individual policies. I want to iterate over LSTM hidden cell sizes, but when I specify a grid via tune.grid_search([x, x, x]) it is applied separately to each agent. How would I enforce the fact that all agents should share the same architecture?

Example:

exper_params = {"lstm_cell_size": tune.grid_search([32, 64])}


policy_map = {
    "policy_0": (
        None, obs_space_high, act_space_high,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": exper_params["lstm_cell_size"],
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
    "policy_1": (
            None, obs_space_high, act_space_high,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": exper_params["lstm_cell_size"],
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
    "policy_2": (
        None, obs_space_high, act_space_high,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": exper_params["lstm_cell_size"],
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
}

Hi @henry_lei, I’m not exactly sure what you want to achieve.

The tune.grid_search defines values to search over. Each of the values is sampled exactly num_samples times.

Are you talking about keeping policy_fcnet_hiddens and policy_max_seq_length constant? If so, you can just pass a constant in the search space (exper_params). If you also want to search over these, but want to keep the samples values constant across runs, you can use the constant_grid_search parameters of the BasicVariantGenerator.

I want to search over a couple of values of the network parameter “LSTM hidden cell size”, but I want each of the policies to share the value produced by the generator so that the network architecture for each of the agents is homogenous. Currently, as the config is written, it seems that the generator is being passed instead of the value it produces. For example, if I specify a grid_search generator in exper_params as grid_search([1, 2, 3]), it will produce trials like:

policy_1: 1, policy_2: 1, policy_3: 1
policy_1: 1, policy_2: 1, policy_3: 2
policy_1: 1, policy_2: 1, policy_3: 3
policy_1: 1, policy_2: 2, policy_3: 1 etc

for a total of 27 trials

whereas I would want
policy_1: 1, policy_2: 1, policy_3: 1,
policy_1: 2, policy_2: 2, policy_3: 2,
policy_1: 3, policy_2: 3, policy_3: 3,

for a total of 3 trials.

Can you post your full config and the code where you instantiate the Tuner and call tuner.fit() (or even more of the code)?

If you want to use the same LSTM size across policies, you should be able to just instantiate them with the same config parameter within the trainable. Since this is presumably related to Rllib, more context and code would be helpful.

# Policy Selection Method
def select_policy(agent_id):
    if agent_id == "deputy_0":
        policyname = "policy_0"
    elif agent_id == "deputy_1":
        policyname = "policy_1"
    elif agent_id == "deputy_2":
        policyname = "policy_2"
    return policyname



# Training configs
exper_params = {"bufferCap": 100000, "burn_in": 20, "learn_rate": tune.grid_search([0.00005, 0.0001, 0.0005, 0.001]), "batch_size": tune.grid_search([128, 256, 512, 1028]),
        "discount_rate": tune.grid_search([0.75, 0.85, 0.95, 0.99]), "num_workers": 15, "policy_fcnet_hiddens": [64, 64], "lstm_cell_size": tune.grid_search([32, 64, 128, 256]),
        "policy_max_seq_length": 20, "timesteps_trained": 75000}

# Chief object and viewpoint params
chief_params = {"Point cloud": infoEnv.chief_object.ptCldName, "Number of points": infoEnv.chief_object.numPoints,
        "Diam of bounding box": infoEnv.chief_object.diam, "Projection diam": infoEnv.chief_object.sphereRadius,
        "Viewpoint dist from origin": infoEnv.chief_object.viewScale}

# Environment params
env_params = {"Rotation Mode": infoEnv.env_config["RotationMode"], "Number of viewpoints": infoEnv.num_inspection_points, "Agent Field of View": infoEnv.FOV,
        "Angular velocity scaling": infoEnv.SF, "POI reward": infoEnv.POI_reward, "Fuel penalty": infoEnv.fuel_penalty, "Inspection threshold": infoEnv.info_thresh,
        "Reward tranlsation": infoEnv.reward_translation}


# Configs`Preformatted text`
config = R2D2Config()
config.environment(env="HLInfoInspEnv", env_config=env_config)
replay_config = config.replay_buffer_config.update({"capacity": exper_params["bufferCap"], "replay_burn_in": exper_params["burn_in"]})
config.training(lr= exper_params["learn_rate"])
config.training(train_batch_size= exper_params["batch_size"])
config.training(gamma= exper_params["discount_rate"]) 
config.rollouts(num_rollout_workers = exper_params["num_workers"])
policy_map = {
    "policy_0": (
        None, obs_space_high, act_space_high,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": exper_params["lstm_cell_size"],
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
    "policy_1": (
            None, obs_space_high, act_space_high,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": exper_params["lstm_cell_size"],
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
    "policy_2": (
        None, obs_space_high, act_space_high,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": exper_params["lstm_cell_size"],
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
}
config.multi_agent(policies=policy_map, policy_mapping_fn=select_policy)

# Run configs
stop_dict = {'timesteps_total': exper_params["timesteps_trained"]}


# Train - saves experiment to an output folder.
tuner = tune.Tuner(
    "R2D2", 
    run_config=air.RunConfig(
        name = "experiment_output",
        stop=stop_dict, 
        local_dir = output_dir,
        sync_config=sync_config, 
        checkpoint_config=air.CheckpointConfig(
            checkpoint_score_attribute="episode_reward_mean",
            checkpoint_frequency=1, 
            num_to_keep=2,
            ),
        ), 
    param_space=config.to_dict())
results = tuner.fit()

Thanks! This is helpful.

The way you can do this is with tune.sample_from, which can access existing config keys.

This would look e.g. like this:

policy_map = {
    "policy_0": (
        None, 1, 10,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": tune.grid_search([32, 64, 128, 256]),
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
    "policy_1": (
            None, 1, 10,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": tune.sample_from(lambda config: config["policies"]["policy_0"][3]["model"]["lstm_cell_size"]),
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
    "policy_2": (
        None, 1, 10,
        {"model": {"fcnet_hiddens": exper_params["policy_fcnet_hiddens"], "fcnet_activation": "tanh",
                   "use_lstm": True,
                   "lstm_cell_size": tune.sample_from(lambda config: config["policies"]["policy_0"][3]["model"]["lstm_cell_size"]),
                   "max_seq_len": exper_params["policy_max_seq_length"]}}),
}

Notice how we use grid_search only once and otherwise refer to the existing parameter with tune.sample_from(lambda config: config["policies"]["policy_0"][3]["model"]["lstm_cell_size"])

1 Like