Global optima with centralized critic (basic understanding)

hi guys, I just spend over two days tying to find out how a centralized critic leads MARL agents to a global optima. So I thought why no asking here and maybe help other MARL beginners.

In my test environment there a two centralized critic PPO agents. Each agent’s goal is to go to his “house”. The observation of each agent is his x and y coordinates as well as the x and y coordinates of his goal. Agent 1 receives -1 reward and agent 2 receives -2 reward each time step. When an agent eventually stands inside his house he receives 0 reward each time step. The Observation of the shared Critic is the agents ops as well as the opponent agents ops and action (just like in the ray/rllib/examples/ example).


For the global optimum agent 1 has to wait until agent 2 reaches his house and move short after him through the corridor. From reading various paper I think that the centralized critic PPO is capable of learning how to reach the global optimum, and also some of my test are showing that he can, but I really don’t get how the centralized critic is doing this. Because I thought in the end it’s just a citric with more observations.

Am also thankful for any valuable recourses.

I’m also new to the multi-agent setting, but I think your last sentence is exactly the point. If I understand correctly, the centralized critic knows the state of agent 1 and 2, whereas the oponents position is not included in the normal observation. Thus, estimating the reward based on the agent observation will not be possible, because e.g. in the state you have drawn, the value action “move down” for agent 1 strongly depends on whether agent 2 is inside the tunnel or not (assuming that they can block the way). Note also, that the agent does not have a notion of time, thus you cannot expect it to learn to “wait in the beginning”, instead it would need to wait based on its observation. I hope this illustrates, that there is a lack of crucial information.

Without the position of both agents, the estimation of the value is significantly harder, if not impossible. With both positions available, this is almost trivial. As a consequence the training is, of course, faster and converges better.

That’s my view on this issue, let me know what you think!



Hi @abrandenb,
thx for your answer. I know that a centralized critic can much better predict appropriate q-values because of its larger observations.

My problem is that many papers I have read state that a centralized critic leads to a global optimum, and I don’t understand how a centralized critic is doing this. Because if a centralized critic acts just like a critic with more informations, each agent still aims to maximize his own reward - and not the global optimum.

This would mean that it was just a coincidence that my runs lead to a global optimum (Agent 2 is the first who enters his house) and it is equally likely that Agent 1 is the first one who is entering his house, which would not be a global optimum since agent 1 receives -1 reward and agent 2 receives -2 reward each timestep they arent in the house.

Ah, I see, I’ve not understood the core of the question, sorry!

I have the feeling the centralized critic can make it easier to reach the global optimum, but does not give a guarantee. I’ve not read the paper, but Fig. 7 (Appendix B) in this paper by Lyu et al. hints that the global optimum is not reached even in simple settings.

Maybe reading the paper gives you some insight, when you find something out, let me know! Like I said before, I’m also new to the multi-agent world, so I’m eager to learn :slight_smile:


Hi @abrandenb thx for the paper, it really sounds interesting.

Do you know how to “train / implement” a centralized critic? I think this would answer my question.

Because when a centralized Critic is really just a critic with more observation and its loss is calculated that same way as it would in a normal actor critic architecture, then there is no doubt that a centralized critic just maximizes its actors reward and not the global optimum.

But somehow I have in mind that the loss of a centralized critic is calculated with respect to the sum of all agents rewards, and therefore he knows how to reach a global optimum.

Somehow I doubt that this is the case, at least I would not implement it like this. Currently, I actually work on a competitive environment where for the two agents, I define the rewards r2 = -r1. If a centralized critic would take the sum of both, the value estimate would be 0. Therefore, it would only make sense to train cooperative agents in a centralized way. That being said, I’m training decentralized, so I am not sure whether this is the case or not.

I rather believe the following, which I think is also reflected in section 3.3 (Eq. 4) of the paper from the last post: The centralized critic outputs values [v1, v2] for both agents, but based on [a1, a2, o1, o2] for both agents. In the paper, they are not talking about summing the values, but rather merge the h_i to a “bold” h. From the notation, I would assume, that h is a vector (i.e. h = [v1, v2]).

There are a few things you can do to confirm or deny this:

  1. Check the dimensionality of your centralized critic outputs. If it’s 2D, by assumption should be correct. But you can additionally try to:
  2. Actually check the value predictions that are saved in your batches. If they are different between the two agents, my assumption should be correct.

Please let me know what you find out.

1 Like

Hi @abrandenb since every agent has its own centralized critic the output shape is 1. I added the code form svens CentralizedCritic example, which is available under rllib/examples/models/ in GitHub.

class TorchCentralizedCriticModel(TorchModelV2, nn.Module):
"""Multi-agent model that implements a centralized VF."""

def __init__(self, obs_space, action_space, num_outputs, model_config,
    TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                          model_config, name)

    # Base of the model
    self.model = TorchFC(obs_space, action_space, num_outputs,
                         model_config, name)

    # Central VF maps (obs, opp_obs, opp_act) -> vf_pred
    input_size = 6 + 6 + 2  # obs + opp_obs + opp_act
    self.central_vf = nn.Sequential(
        SlimFC(input_size, 16, activation_fn=nn.Tanh),
        SlimFC(16, 1),

def forward(self, input_dict, state, seq_lens):
    model_out, _ = self.model(input_dict, state, seq_lens)
    return model_out, []

def central_value_function(self, obs, opponent_obs, opponent_actions):
    input_ =[
        obs, opponent_obs,
        torch.nn.functional.one_hot(opponent_actions.long(), 2).float()
    ], 1)
    return torch.reshape(self.central_vf(input_), [-1])

Maybe @sven1977 could tell us if the loss of a centralized critic is calculated in a different way than the loss of a critic in a normal actor critic architecture. Or to simplify the question: is a centralized critic really just a critic with more observations and beside of that completely equal to a “normal” /decentralized critic?

This discusssion seems to be very similar to the one I started earlier this week. Maybe this is interesting for you, @CodingBurmer: Multi-Agent System for maximizing the overall reward of all agents? ?

1 Like

thx for your reply @aronium. I think my problem is far more basic than yours, since I just want to know if:

@CodingBurmer it depends on if you followed the or example. If it was the second one then the loss function is not changed. All you are doing there is adding extra information to the observation but this information can be crucial. In your case described above it converts your problem from a partially observable stochastic game (POSG) to a Decentralized Markov Decision Process (Dec-MDP) (with individual rewards). Usually Dec-MDPs have a shared reward but in your setup each agent has an individual reward. This might have a more specific name but I do not know it.

If you used the first approach, then the loss is changed so that all of the agents share a centralized value function. This is one model that is learning to predict the value function for all agents that share this model. In your case the value function would still be trying to predict an individual reward and not a shared reward. One thing you want to do in that case is make sure the observation includes an indicator as to which agent is receiving the observation. The example environment, TwoStepGame is doing that.

Here is the code that is modifying the loss function:

1 Like