Hi all,
I developed a cooperative multi-agent env in which there are several agents each of which has its own local goal, but an episode terminates when the global goal is achieved. My env is a turn-taking env meaning that the agents do not take action simultaneously but consecutively.
I am using RLlib, but I do not know how to share the global reward between agents in this setting.
This is how my env works in RLlib:
For simplicity, let’s assume I have only two agents.
First agent_1
takes an action. I send action={"agent_1": 'left'}
to the env.
Then, the env returns: obs={"agnet_2":some_arr}
, done:{"__all__": False}
, and reward={"agent_1": -1}
. Obs dict consists of the key for the agent_2
, because the next time is agent_2
's turn.
Similarly, agent_2
takes an action. I send action={"agent_2": 'up'}
to the env.
Then, the env returns: obs={"agent_1":some_new_arr}
, done:{"__all__": False}
, and reward={"agent_2": -1}
. Obs dict consists of the key for the agent_1
, because the next time is agent_1
's turn.
Let’s assume after some time steps, it is again, for example, agent_2
's turn. It takes an action and this time the global objective is satisfied.
So, now my question is that how do I give the agents the global reward?
For the agent_2
, the env can easily send reward:{"agent_2": GLOBAL_REWARD}
to agent_2
who just took action.
But the problem is that agent_1
took its action in the previous time step and the env already sends the local reward to it in the previous time step. So, how the env can send the GLOBAL_REWARD to agent_1
?
I could not find any similar example in RLlib.
One solution that comes to my mind is that:
Although the episode already terminates, I run it for another time step and force agent_1
to take no_action
action, and I force the env to give him the GLOBAL_REWARD. Then, I really terminate the episode.
But, I am not sure how this solution is logical!
I would be happy if you could share your ideas here.
Thanks!