Centralized critic PPO with non-homogenous agents

Hello! I’m looking for some guidance on how to approach implementing a centralized critic for two agents who have different action and observation spaces.

obs_space = Dict(agent-0:Box(-2147483648.0, 2147483648.0, (11,), float32), 
                 agent-1:Box(-2147483648.0, 2147483648.0, (13,), float32))

action_space = Dict(agent-0:Discrete(4), 
                    agent-1:Discrete(7))

I’m following the centralized_critic_2 example, but the example is built for an environment where agents have the same observation and action spaces/shapes.

My attempted approach adds an observer_space with own_obs, opponent_obs and opponent_action for each agent respectively:

observer_space_0 = Dict(
    {
        "own_obs": obs_space['agent-0'],
        # These two fields are filled in by the CentralCriticObserver, and are
        # not used for inference, only for training.
        "opponent_obs": obs_space['agent-1'],
        "opponent_action": action_space['agent-1'],
    }
)

observer_space_1 = Dict(
    {
        "own_obs": obs_space['agent-1'],
        # These two fields are filled in by the CentralCriticObserver, and are
        # not used for inference, only for training.
        "opponent_obs": obs_space['agent-0'],
        "opponent_action": action_space['agent-0'],
    }
)

And then passes these observer_spaces into each agent’s policy in the config:

config = {
    "env": env_name,
    "batch_mode": "complete_episodes",
    "callbacks": FillInActions,
    "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
    "num_workers": 1,
    "multiagent": {
        "policies": {
            # modified policies
            "pol1": (None, observer_space_0, action_space['agent-0'], {}),
            "pol2": (None, observer_space_1, action_space['agent-1'], {}),
        },
        "policy_mapping_fn": (lambda aid, **kwargs: "pol1" if aid == 'agent-0' else "pol2"),
        "observation_fn": central_critic_observer,
    },
    "model": {
        "custom_model": "custom_cc_model",
    },
    "framework": 'tf',
}

stop = {
    "training_iteration": 100,
    "timesteps_total": 10000,
    "episode_reward_mean": 10000,
}

Then in the custom model, I can adjust the input shape to fit one of my agents’ observation spaces.

Should I incorporate a second action model here? The provided example is consistent with the MAPPO paper where homogenous agents share a single actor network, but I’m not sure how to approach this when the observations for each agent are different.

class YetAnotherCentralizedCriticModel(TFModelV2):

    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
        super(YetAnotherCentralizedCriticModel, self).__init__(
            obs_space, action_space, num_outputs, model_config, name
        )

        self.action_model = FullyConnectedNetwork(
            Box(-2147483648.0, 2147483648.0, (13,),  # agent-1 observation
            action_space,
            num_outputs,
            model_config,
            name + "_action",
        )

        self.value_model = FullyConnectedNetwork(
            obs_space, action_space, 1, model_config, name + "_vf"
        )

    def forward(self, input_dict, state, seq_lens):
        self._value_out, _ = self.value_model(
            {"obs": input_dict["obs_flat"]}, state, seq_lens
        )
        return self.action_model({"obs": input_dict["obs"]["own_obs"]}, state, seq_lens)

    def value_function(self):
        return tf.reshape(self._value_out, [-1])

To add one more complication, both agents do not necessarily act at each time step, so the observation dictionary returned by my step function may have observations for one agent, or the other, or both. Any thoughts on whether this detail may be problematic?