How to distribute the final reward among agents in a fully-cooperative turn-taking environmet?

Hi all,
I developed a cooperative multi-agent env in which there are several agents each of which has its own local goal, but an episode terminates when the global goal is achieved. My env is a turn-taking env meaning that the agents do not take action simultaneously but consecutively.

I am using RLlib, but I do not know how to share the global reward between agents in this setting.

This is how my env works in RLlib:

For simplicity, let’s assume I have only two agents.
First agent_1 takes an action. I send action={"agent_1": 'left'} to the env.
Then, the env returns: obs={"agnet_2":some_arr}, done:{"__all__": False}, and reward={"agent_1": -1}. Obs dict consists of the key for the agent_2, because the next time is agent_2's turn.

Similarly, agent_2 takes an action. I send action={"agent_2": 'up'} to the env.
Then, the env returns: obs={"agent_1":some_new_arr}, done:{"__all__": False}, and reward={"agent_2": -1}. Obs dict consists of the key for the agent_1, because the next time is agent_1's turn.

Let’s assume after some time steps, it is again, for example, agent_2's turn. It takes an action and this time the global objective is satisfied.

So, now my question is that how do I give the agents the global reward?
For the agent_2, the env can easily send reward:{"agent_2": GLOBAL_REWARD} to agent_2 who just took action.
But the problem is that agent_1 took its action in the previous time step and the env already sends the local reward to it in the previous time step. So, how the env can send the GLOBAL_REWARD to agent_1?

I could not find any similar example in RLlib.

One solution that comes to my mind is that:
Although the episode already terminates, I run it for another time step and force agent_1 to take no_action action, and I force the env to give him the GLOBAL_REWARD. Then, I really terminate the episode.

But, I am not sure how this solution is logical!

I would be happy if you could share your ideas here.


Hi @deepgravity, it sounds like you have control over your environment design. On the last step, when the global reward is reached, you can just output the reward for both agents:

reward = {'agent_2': GLOBAL_REWARD, 'agent_1': GLOBAL_REWARD}
1 Like

Hi @rusu24edward , Thanks for your reply. Yes, I have control over the env, and I can send the reward you offered. I indeed thought about this way before. But, the problem is that if I use for example DQN, besides reward I need to save the action and the obs of agent_1 into the replay buffer.

Let’s assume in time step K, for agent_1 we had: experience={"obs":arr_K, "action": 'left', "reward":-1, "done['__all__']": False}.
And now in the current time step L for agent_2 we have: experience={"obs":arr_L, "action":'up', "reward":GLOBAL_REWARD, "done['__all__']":True}. And the episode terminates here.

Now, if I use the reward you offered it means that I need to store another experience for agent_1 as follows:
experience={"obs":arr_K, "action":'left', "reward":GLOBAL_REWARD, "done['__all__']":True}.
But this is not an accurate experience, because taking action left in obs_K by agent_1 does not deserve GLOBAL_REWARD.

Hi @deepgravity,

I think we’re assuming different things about how rllib records the rewards. Perhaps @sven1977 can clarify this. My understanding has been that because the reward is a function of the state, action, and next_state, then rllib would sum any rewards given before the next observation. Is this right Sven?

Hi @mannyv ,
Would you please have a look at this question? Thanks!