hi guys, I just spend over two days tying to find out how a centralized critic leads MARL agents to a global optima. So I thought why no asking here and maybe help other MARL beginners.
In my test environment there a two centralized critic PPO agents. Each agent’s goal is to go to his “house”. The observation of each agent is his x and y coordinates as well as the x and y coordinates of his goal. Agent 1 receives -1 reward and agent 2 receives -2 reward each time step. When an agent eventually stands inside his house he receives 0 reward each time step. The Observation of the shared Critic is the agents ops as well as the opponent agents ops and action (just like in the ray/rllib/examples/centralized_critic.py example).
For the global optimum agent 1 has to wait until agent 2 reaches his house and move short after him through the corridor. From reading various paper I think that the centralized critic PPO is capable of learning how to reach the global optimum, and also some of my test are showing that he can, but I really don’t get how the centralized critic is doing this. Because I thought in the end it’s just a citric with more observations.
Am also thankful for any valuable recourses.