MARL modeling issue

Hello community,

I’m not sure how to handle the following case issue I have:
In my MARL use case ready agents can choose to do the identical action at some time, but actually this action only can be executed by one agent and not two or more agents at the same time!

So far, I’m not sure how to best handle such situations? Ideas I’ve thought about are

  • simply prohibit the agents from doing the identical action at the same time and “reward” these agents with some penalty (hoping the agents will learn it)

  • maybe use some kind of a conditional action distribution (comparable to this pattern)

  • alternatively break such situations of more ready agents at the same time and artifically process each ready agent successively in marginal timesteps

I would be really glad about any suggestions or please let me know how you manage it.

@sven1977 please take a look!

Have you thought about using on_postprocess_trajectory() callback to mutate the collected samples? For example, if you add a “no-op” support in your environment, then in that callback function, you could replace identical actions between the agents with “no-op”.

https://docs.ray.io/en/master/rllib-training.html?highlight=on_postprocess_trajectory#callbacks-and-custom-metrics

@RickLan That sounds interesting. My agents already have the chance to take a “no-op” (“wait”) action, but I haven’t thought about using this on_postprocess_trajectory() callback yet.
If I understand you correctly then I should replace the identical action with “no-op” for all agents which had taken that action. But might this tend to that agents could learn to always take “no-op” in such situations? :thinking: And what do you think about how to give feedback to the agents via rewards (punish them?)?

Reward or punish also works. I see the mutation is a way to modify the exploration of policy. If the agents must learn not to output identical actions, then reward/punish is probably needed.

1 Like

@RickLan thanks for your thoughts! I’ll give it a try.

Additionally, my most recent idea is to establish some kind of hierarchical RL with a further agent (“supervisor”) at an upper (or lower) level who decides which of the ready agents may first decide on his next action. Then agents couldn’t choose an identical action at the same time.
Does anyone have experiences with something similar?

1 Like

I think if you need to keep the synchronous nature of your agents stepping at the same time, then providing negative rewards for identical actions would be best (question is still, which agent is allowed to decide first and won’t get the penalty).
Otherwise, in case you would like to change your dynamics to be sequential, action masking may also help (agent0 picks a0, agent1 gets the respective action mask in its observation (provided by the env) and uses it to sample, but NOT a0). This would be similar to @RickLan 's postprocess suggestion. I think each of these approaches has its advantages and disadvantages.

1 Like

To answer this open question “which agent is allowed to decide first” I thought of using that further agent (“supervisor”)

to break the synchronous nature of my agents stepping at the same time. Unfortunately, it’s hard to know the right way to go.