Hello!
Has anyone ever tried to use the reward function as a way to add priorisation of specific agents in a MARL problem?
E.g. I managed to train the 4 agents in the following environment with a customized version of RLlib’s multi-agent PPO example, where they use a centralized critic (ray/centralized_critic.py at master · ray-project/ray · GitHub):
For each timestep an agent is not in it’s final position (the box which is coloured in the same way as the agent) he gets a negative reward of -1. The agents learned to use the corridor after each other and solved the problem as fast as possible.
My goal is now to specify a certain agent to cross the corridor first.
My first thought was to multiply the negative reward of this agent by a number (I ran same examples with multiplications by 5, 10, …), so that this agent has a bigger impact on the total reward and the other agents learn to give priority to him.
Until now I was not able to produce the desired result and I couldn’t find any examples or papers dealing with a similar problem.
Has anyone dealt with this kind of problem or has another idea of bringing priority into such a MARL problem?
1 Like
Are you training a single policy, or does each agent train its own policy?
I would try changing the rewards for the OTHER agents so that they are penalized for going out of order. That way, they can learn not to go UNTIL the first agent has already reached its destination (or some other way you can think of to represent that the first agent has gone first).
This is a very interesting task, thanks for sharing about it!
1 Like
Hey @rusu24edward , thank you for your input!
Each agent is training its own policy.
Currently I’m testing several different ways of giving the rewards.
I’ll keep you updated!
I got a follow-up question on this topic, maybe @sven1977 can help here:
Using the RLlib MultiAgentEnv interface, the actions are handed to the environment with an action dictionary that looks like this (let’s assume each agent gives an action at each timestep, the env is reset after a maximum number of timesteps):
{"agent_0": 0,
"agent_1": 2,
"agent_2": 1,
"agent_3": 0
}
If I iterate through this dictionary to perform the logic in the env corresponding to the actions of each agent using something like this:
for agent_i, action_i in actions.items():
**perform logic in env-> move agent into corresponding direction of the selected action**
Then agent_0 would have kind of an advantage, because he’d always be the first agent whose action is performed, and as only one agent can be in one grid cell at a certain time step, he would always be given “priority”, isn’t it?
A fair solution I could think of is to shuffle the actions dictionary each step, before I iterate through it.
On the other hand, I could use this circumstance and sort the dictionary according to my desired priorisation of the agents.
Do I get this right?
If you use the environment to sort the dictionary according to the desired priority, then the agents may not actually learned priority. They could be learning the same policy they are now and just happen to get the right order because the agent you want to finish first always goes first. For example, with default sorting agent0
will always go first. If that’s who you want to go first, then it will look like you’ve learned to prioritize agent0
but it’s actually just an artifact of the environment. If you really want to say that your agents learned priority, you should make the prioritized agent go last to ensure that the others have really learned to wait for it to move.
The random ordering you suggest for fairness makes sense to me, although if the policy you train is general enough, it shouldn’t make a difference. You can also try a TurnBased approach where each agent explicitly has a turn (and thus receives the most up to date observation of the environment right before its action).
I’m interested to see how this turns out, please keep me posted!
1 Like
Thanks for your input on this, @rusu24edward . I agree with everything you said.
Yes, having the env always process agent_0’s action choice 1st would add a strong bias.
Solutions:
- Randomize the env’s for loop over the agents.
- Turn-based: However, if the order is always the same, this would still mean that agent_0 has the “advantage” of being the start player, which could mean that it makes sense for this agent to enter the tunnel first.
- Also try: Penalizing the prioritized agent more (for losing time) than the others.
1 Like
@korbinian-hoermann could you share your successful implementation of central critic approach for PPO with more than two agents?