MARL modeling issue

klausk55 · March 26, 2021, 11:01am

Hello community,

I’m not sure how to handle the following case issue I have:
In my MARL use case ready agents can choose to do the identical action at some time, but actually this action only can be executed by one agent and not two or more agents at the same time!

So far, I’m not sure how to best handle such situations? Ideas I’ve thought about are

simply prohibit the agents from doing the identical action at the same time and “reward” these agents with some penalty (hoping the agents will learn it)
maybe use some kind of a conditional action distribution (comparable to this pattern)
alternatively break such situations of more ready agents at the same time and artifically process each ready agent successively in marginal timesteps

I would be really glad about any suggestions or please let me know how you manage it.

eoakes · March 26, 2021, 1:21pm

@sven1977 please take a look!

RickLan · March 29, 2021, 9:22am

Have you thought about using on_postprocess_trajectory() callback to mutate the collected samples? For example, if you add a “no-op” support in your environment, then in that callback function, you could replace identical actions between the agents with “no-op”.

https://docs.ray.io/en/master/rllib-training.html?highlight=on_postprocess_trajectory#callbacks-and-custom-metrics

klausk55 · March 29, 2021, 9:51am

@RickLan That sounds interesting. My agents already have the chance to take a “no-op” (“wait”) action, but I haven’t thought about using this on_postprocess_trajectory() callback yet.
If I understand you correctly then I should replace the identical action with “no-op” for all agents which had taken that action. But might this tend to that agents could learn to always take “no-op” in such situations? And what do you think about how to give feedback to the agents via rewards (punish them?)?

RickLan · March 29, 2021, 10:11am

Reward or punish also works. I see the mutation is a way to modify the exploration of policy. If the agents must learn not to output identical actions, then reward/punish is probably needed.

klausk55 · March 31, 2021, 11:25am

@RickLan thanks for your thoughts! I’ll give it a try.

Additionally, my most recent idea is to establish some kind of hierarchical RL with a further agent (“supervisor”) at an upper (or lower) level who decides which of the ready agents may first decide on his next action. Then agents couldn’t choose an identical action at the same time.
Does anyone have experiences with something similar?

sven1977 · March 31, 2021, 12:54pm

I think if you need to keep the synchronous nature of your agents stepping at the same time, then providing negative rewards for identical actions would be best (question is still, which agent is allowed to decide first and won’t get the penalty).
Otherwise, in case you would like to change your dynamics to be sequential, action masking may also help (agent0 picks a0, agent1 gets the respective action mask in its observation (provided by the env) and uses it to sample, but NOT a0). This would be similar to @RickLan 's postprocess suggestion. I think each of these approaches has its advantages and disadvantages.

klausk55 · March 31, 2021, 1:23pm

To answer this open question “which agent is allowed to decide first” I thought of using that further agent (“supervisor”)

to break the synchronous nature of my agents stepping at the same time. Unfortunately, it’s hard to know the right way to go.

Topic		Replies	Views
Adding priority to MARL RLlib	5	701	October 19, 2021
Adding virtual agents in MARL RLlib	1	468	October 3, 2021
Different step space for different agents RLlib	7	844	August 11, 2021
Constant actions multi-agent RLlib	1	439	November 12, 2021
Multi-agent setting different step sizes for agents and how actions are passed? RLlib	2	622	April 26, 2022

MARL modeling issue

Related topics