I think if you need to keep the synchronous nature of your agents stepping at the same time, then providing negative rewards for identical actions would be best (question is still, which agent is allowed to decide first and won’t get the penalty).
Otherwise, in case you would like to change your dynamics to be sequential, action masking may also help (agent0 picks a0, agent1 gets the respective action mask in its observation (provided by the env) and uses it to sample, but NOT a0). This would be similar to @RickLan 's postprocess suggestion. I think each of these approaches has its advantages and disadvantages.
1 Like