Hello all, can you help me to understand more intuitively how the multi-agent training process is done? For example in the case of Multi-agent PPO, does the trainer minimize the loss function of all the agents at the same time? or try to minimize the summation of the losses?
I understand how other algorithms like MADDPG share the critic, but I couldn’t find the documentation of how multi-agent PPO works.
Yes, RLlib has basically two different multi-agent approaches:
Specialized MA algos, such as QMIX and MADDPG, which train a centralized critic model and output actions as a single (Tuple) action.
Independent MA learning (where each policy you define gets updated separately given its experience data from the environment), which happens for all the other Trainers (not QMIX/MADDPG) every time you specify the “multiagent” sub-config and provide one or more policies (with their classes, action/obs spaces, and config overrides), a agentID->policyID mapping fn, etc… In this case, yes, all the policies’ losses are minimized separately. There is no shared loss, network, or anything. Basically, each policy learns by itself and considers all the other agents part of the environment.