Hi, I’m working on a MARL env where virtual agents are added in train time randomly. More specifically, I have agents A1, A2, … The agents all have their own unshared models. How should I approach it If I plan to add virtual1_A1, which behaves independently from A1 but uses the same model as A1? This is kind of tricky since they use the same policy, but we need to make sure they only see their own hidden states.
Here’s my idea: Since I don’t need to enumerate which agents will be in the environment, I can just specify in my policy_mapping_fn that all agents whose IDs end with A1 use the same policy. This should make sure virtual1_A1 doesn’t share hidden states with A1. My concern is, this will probably cause the replay buffer collected from A1 and virtual_A1 to update their policy sequentially, since they are considered to be different agents.
Should I worry about this sequential-ness of the update? Is there a way to merge the buffer at learning time, or bypass the splitting altogether?