Centralized critic PPO with non-homogenous agents

andrewdcamp · February 27, 2022, 7:29pm

Hello! I’m looking for some guidance on how to approach implementing a centralized critic for two agents who have different action and observation spaces.

obs_space = Dict(agent-0:Box(-2147483648.0, 2147483648.0, (11,), float32), 
                 agent-1:Box(-2147483648.0, 2147483648.0, (13,), float32))

action_space = Dict(agent-0:Discrete(4), 
                    agent-1:Discrete(7))

I’m following the centralized_critic_2 example, but the example is built for an environment where agents have the same observation and action spaces/shapes.

My attempted approach adds an observer_space with own_obs, opponent_obs and opponent_action for each agent respectively:

observer_space_0 = Dict(
    {
        "own_obs": obs_space['agent-0'],
        # These two fields are filled in by the CentralCriticObserver, and are
        # not used for inference, only for training.
        "opponent_obs": obs_space['agent-1'],
        "opponent_action": action_space['agent-1'],
    }
)

observer_space_1 = Dict(
    {
        "own_obs": obs_space['agent-1'],
        # These two fields are filled in by the CentralCriticObserver, and are
        # not used for inference, only for training.
        "opponent_obs": obs_space['agent-0'],
        "opponent_action": action_space['agent-0'],
    }
)

And then passes these observer_spaces into each agent’s policy in the config:

config = {
    "env": env_name,
    "batch_mode": "complete_episodes",
    "callbacks": FillInActions,
    "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
    "num_workers": 1,
    "multiagent": {
        "policies": {
            # modified policies
            "pol1": (None, observer_space_0, action_space['agent-0'], {}),
            "pol2": (None, observer_space_1, action_space['agent-1'], {}),
        },
        "policy_mapping_fn": (lambda aid, **kwargs: "pol1" if aid == 'agent-0' else "pol2"),
        "observation_fn": central_critic_observer,
    },
    "model": {
        "custom_model": "custom_cc_model",
    },
    "framework": 'tf',
}

stop = {
    "training_iteration": 100,
    "timesteps_total": 10000,
    "episode_reward_mean": 10000,
}

Then in the custom model, I can adjust the input shape to fit one of my agents’ observation spaces.

Should I incorporate a second action model here? The provided example is consistent with the MAPPO paper where homogenous agents share a single actor network, but I’m not sure how to approach this when the observations for each agent are different.

class YetAnotherCentralizedCriticModel(TFModelV2):

    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
        super(YetAnotherCentralizedCriticModel, self).__init__(
            obs_space, action_space, num_outputs, model_config, name
        )

        self.action_model = FullyConnectedNetwork(
            Box(-2147483648.0, 2147483648.0, (13,),  # agent-1 observation
            action_space,
            num_outputs,
            model_config,
            name + "_action",
        )

        self.value_model = FullyConnectedNetwork(
            obs_space, action_space, 1, model_config, name + "_vf"
        )

    def forward(self, input_dict, state, seq_lens):
        self._value_out, _ = self.value_model(
            {"obs": input_dict["obs_flat"]}, state, seq_lens
        )
        return self.action_model({"obs": input_dict["obs"]["own_obs"]}, state, seq_lens)

    def value_function(self):
        return tf.reshape(self._value_out, [-1])

To add one more complication, both agents do not necessarily act at each time step, so the observation dictionary returned by my step function may have observations for one agent, or the other, or both. Any thoughts on whether this detail may be problematic?

Topic		Replies	Views
PPO centralized critic example with more than two agents RLlib	4	1830	October 19, 2021
What is the proper way to deal with varying observation space? RLlib	7	1505	April 20, 2021
Different observation space in MultiAgentEnv RLlib	2	737	August 12, 2021
Return obs_space in gym.Box format RLlib	1	555	March 6, 2022
Implementing _forward() Method in PPO Custom Multi-Agent Shared Policy RLlib	1	52	February 19, 2025

Centralized critic PPO with non-homogenous agents

Related topics