Understanding QMIX

Hello!

I’m trying to understand how QMIX works in terms of adjusting policies. As far as I understand, this algorithm allows for centralized learning (with the mixing neural network) and decentralized execution.

However, during the training, only one policy (default_policy) is saved in the established checkpoints, which I interpret as the one of the mixing neural network. Is this correct? If so, how do I get the independent policies of each agent? This would help me to be able to establish independent actions later during the evaluation of these.

On the other hand, I also understand that the observation for QMIX must be a tuple, where the observation of the agent is provided on the one hand and the complete state of the environment on the other (in the example given in the library the only difference between these two is that the agent’s observation includes the agent’s ID, that is, that it is a fully observable environment). Having said that, is it possible to access the policy only with the observation of the agent or is it necessary to provide the observation and the state in order to calculate an action?

I have other questions, but these are the main ones. I’m trying to train a model that has multiple agents (all homogeneous, as requested by the library), but that can then integrate the pre-trained individual agents into other environments. Is this possible?

Thank you very much,
Germán