Hi @Aceticia,
Your idea is good. Any number of agents can share the same policy. Each will use the policy independently during execution (sampling rollouts).
During training if you do not add any centralizing peices like for example the centralized critic, or an algorithm like qmix or maddpg then each transition is considered separately for each agent.
During the actual loss calculations, the losses are computed in separate batches based on policy not agent. So if you have 3 agents that all map to the same policy, the transition states for each of those 3 agents will be combined in the loss calculation. Again keep in mind that if it is not a multiagent algorithm then the loss on each time step for each agent is considered independently and they are all averaged at the end.
Policies are updated sequentially in a loop one at a time.