I have build a multi-agent-env that is organized hierarchical.
Example action dict for lvl 0 actions for 3 agents(the agent_id is build from “type_of_agent” + _“lvl_of_agent” + “agent_number”):
action_dict = {"A_0_0": 2, "A_0_1":4, "A_0_2":3}
I transition to another level by returning observations with a different “lvl_of_agent” part in the agent ids when returning obs, rew, done after a completed step. The policy mapping function decides the policy to use from the first two parts of the agent id (e.g.: A_0_* maps to policy1, whereas A_1_* maps to policy 2). I adapted this idea from the hierarchical windy maze example.
Example: returns from level 0 to transition to level 1(in the next step)for 3 agents
obs = {"A_1_0": 5, "A_1_1":2, "A_1_2":0}
rew = {"A_1_0": 0, "A_1_1":0, "A_1_2":0}
done = {"A_1_0": false, "A_1_1":false, "A_1_2":false}
So my question is: Have the agent ids in the reward dict to be coupled to the ones in the observation dict or can I return rewards to the policies independently form observations?
In non_hierarchical environments this doesn’t pose a problem since the returns go always to the same policy or set of policies. The problem for me is, that when i switch down a level i can’t report rewards(because nothing happened yet). Additionally it is a problem for the lower levels, that if i don’t take multiple steps on a level i’cant provide rewards to its policy (since the return is already intended for the next level). The only solution would be to store the rewards intermediately and return them whit a level switch.
I guess this could be an issue for trajectory generation but I would like to know how ray handles this and if it is even possible when using for example DQN or PPO. For illustration purpose i created an image of the control flow in my environment.
Thanks for any help in advance!