Share rewards in a cooperative multiagent environment

GCid · September 8, 2022, 8:28am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello!

I have designed a cooperative environment and I am having trouble assigning rewards to agents so as to reinforce cooperative behaviors.
The rewards I am assigning are separate for each agent using the rewards dictionary.
More specifically, I have a time penalty for each step taken by the agent, a small positive reward if the agent took a “correct” action and, lastly, when the team reaches its goal, I assign both of the agents a +1 reward and end the episode.
After training for at least 1m steps, although the agents learn the policy and reach their goals, they don’t actually cooperate with each other.
Taking into account the episode_reward_mean that is calculated at the end of the episodes (which shows the total reward for all the agents of the team, instead of the mean reward of the team), it makes me think that I am not assigning rewards correctly to reinforce cooperative behaviors.

Is there a specific way to assign rewards to reinforce cooperative behaviors?
Or in other words, how do you assign rewards so that the agents maximize their overall reward?

arturn · September 8, 2022, 2:23pm

Hi @GCid ,

Tune will report the sum of rewards as the overall reward and will score based on this to determine checkpoints and so on. This should still not be confused with the way your policies are optimized, which is dependant on how you shape your reward. In your case, it is up to you to design your environment in a way that rewards collaboration. If you want agents to “help” each other, the rewards of the “beneficiary agent” must be included in the rewards of the “helping agent”.

A simple way to do this is, if you only have two agents, you can postprocess episodes to include for example a fraction of the rewards of the other agent.

Cheers

rusu24edward · September 8, 2022, 3:55pm

Do all the agents share the same policy, or are they each training their own?

Can you describe the task more?

GCid · September 8, 2022, 5:23pm

I see, thanks!
I was worried I had to use something like a cumulative reward for both agents.

@rusu24edward
Both agents share the same policy.
It is basically a field coverage environment, where two agents start from the bottom right corner, with each one having a specific area of coverage.
On top of the rewards I mentioned, they’re taking a penalty depending on how large their area of effects are overlapping, but it seems that they are learning slowly.

arturn · September 8, 2022, 5:26pm

Well, if by cumulative reward, you mean rewarding both agents with the sum of their rewards, that’s one version where the other agent’s objective is “worth” just as much. Maybe you can scale this similarly to the learning rate so that agents “mainly” learn to maximize their own rewards first before fine-tuning to help the other agent as well. It very much depends on your environment!

*I should add that I’ve not done this before.

GCid · September 8, 2022, 5:35pm

I am only giving them the extra +1 reward if they succeed (or -1 if they fail).
I will definitely try what you suggested.

As for the scaling, that’s basically what I am doing through a curriculum learning approach.
Initially, it’s just about covering the area (with a larger area of effect).
In future stages, the area of effect gets smaller and the agents take the negative reward if their areas are overlapping.

mannyv · September 9, 2022, 1:25am

Hi @GCid,

A couple questions on your design:

Are the other agents included in each other’s observations?

Are you adding an agent id to the observations to help the policy know which agent it is selecting actions for?

Whay algorithm are you using? If it is an actor critic algorithm are you using a centralized critic that determines the value based on the joint observations of all the agents?

GCid · September 9, 2022, 3:56pm

I am including the other agents’ positions and I am adding the observations to a dictionary with keys as the agent’s id and values the observations.
I’ve been using PPO.
I recently tried SAC, A3C and IMPALA and I have some out of memory problems with those.
A (probably) important detail about the approach is that I am using a 84x84x4 binary image as an observation, with each channel having the following information:

Remaining area to be covered
The obstacles that exist in the environment
The current agent’s position
The other agent’s position

I am assuming that because of this I am having the OOM issues with the algorithms mentioned before. I have the following config values that most likely are the problem of the OOM, but I am not sure if they should be lower. The values are “inspired” from other examples I saw.

{
...
        "num_workers": 5,
        "num_gpus": 1,
        "horizon": 50,
        "train_batch_size": 1024*10,
        "rollout_fragment_length": 1024*1,
        "sgd_minibatch_size": 128,
...
}

Topic		Replies	Views
How to distribute the final reward among agents in a fully-cooperative turn-taking environmet? RLlib	4	280	October 28, 2021
How to share obsrvations and rewards in Multi-Agent ExternallEnv? RLlib	2	430	July 27, 2022
Multi-agent Env with different reward functions for different agents? RLlib	6	405	September 14, 2021
MultiAgentEnv reward and terminated / truncated RLlib	0	317	October 12, 2023
Handling multiple rewards to different branches of model RLlib	3	362	September 15, 2021

Share rewards in a cooperative multiagent environment

Related topics