Policy mapping for computing actions in multi agent env

Is there a way to map the policies for computing actions according to agent ids in a multi-agent env? For training i can provide a policy mapping function but for compute_actions it seems i can only provide a single policy id which is not really helpful for when i have different agents using different policies in a single step. What is the best practice here?
Thanks in advance!

Hi @Blubberblub,

I may be misunderstanding but if you have a trainer with a MA configuration you should be able to pass in a dictionary with mappings from agent_ids to observations and the trainer will take care of using the policy_mapping_fn to get each agent to its correct policy.

Something like this:

trainer = PPOTrainer(...)
obs = env.reset() # {"agent1":obs1, "agent2":obs2,...}
actions, state = trainer.compute_actions(obs)
...

@mannyv Thanks for looking at the problem. I tried it like that. But when i call any of the action compute functions i get a key error ā€œKeyError: ā€˜default_policyā€™ā€. So i assume i have to provide the correct policy names.

Looking in the code of trainer.py i saw that all the action computing have an argument called policy_id that defaults to DEFAULT_POLICY_ID. The functions load the policy with policy = self.get_policy(policy_id).

It is mentioned that you can call compute_actions on the policy directly but this would result in me computing actions for multiple policies per step and then piecing them back together to step the env. So i was looking for a better solution.
I think this case is already handled in rllib but i am struggling to get the code right.

My code for computing actions:

agent = PPOTrainer(config=config, env=MyEnv)
check_point_path = "my_path/checkpoint-302"
agent.restore(check_point_path)

env_config = {"high_lvl_max_count":10,"low_lvl_max_count":10}
env = MyEnv(env_config)

# run until episode ends
episode_reward = 0
done = False

obs = env.reset()
while not done:
	action = agent.compute_single_action(obs)
	obs, reward, done, info = env.step(action)
	episode_reward += reward

More context about my environment:
The env is a hierarchical multi agent environment. I use a policy mapping function for training like this:

def policy_mapping_fn(agent_id, episode, **kwargs):
	if agent_id.startswith("p_d_"):
		return "pot_decider_policy"
	elif agent_id.startswith("p_m_"):
		return "pot_move_policy"

So here is some working code that illustrates the work-around i would have to do to map policies by agent ids correctly.

agent = PPOTrainer(config=config, env=MyEnv)
check_point_path = "/my_path/checkpoint-302"
agent.restore(check_point_path)

env_config = {"high_lvl_max_count":10,"low_lvl_max_count":10}
env = MyEnv(env_config)


done = {"__all__":False}
obs = env.reset()

# run until episode ends
while not done["__all__"]:
	action = {}
	for key, state in obs.items():

		policy_id = ""
		
		if key.startswith("p_d_"):
			policy_id = "pot_decider_policy"
		elif key.startswith("p_m_"):
			policy_id = "pot_move_policy"


		action[key] = agent.compute_single_action(state, policy_id=policy_id)

	obs, reward, done, info = env.step(action)

	# function for removing done agents
	for key,is_done in done.items():
		if is_done and key!="__all__":
			del obs[key]

From the code in the compute_actions function it seems to me that the case of multiple policies that is supported in training via the policy_mapping_fn is not supported in these functions for computing actions. Maybe @sven1977 can clarify and point to the location how this is handled in training?

Edit: i added some lines that remove done agents

Hi @Blubberblub,

Can you share what the config looks like and what one example of what the obs in reset / step look like?

The error you report is one I usually see when either the config does not have multiagent configuration information or the observation is not provided in a dictionary.

You said you have those features so I am unsure why it would be trying to add the default policy.

Implementation detail: internally rllib treats all environments like multi-agent environments. If you provide an environment that is not multiagent it will automatically wrap it as a multiagent environment with one agent that uses a policy called ā€œdefault_policyā€.

1 Like

Thanks @mannyv again for looking. This is my config. I basically pieced it together from some examples for now to get things up and running first. So donā€™t judge. :grinning_face_with_smiling_eyes:

filters = [[16, [4, 4], 2], [32, [4, 4], 2], [256, [11,11], 1]]

config = {
	"env": MyEnv,
	"env_config": {"high_lvl_max_count":10,"low_lvl_max_count":10},
	"entropy_coeff": 0.01,
	"multiagent": {
		"policies": {
			"pot_decider_policy": (None,
								Box(low=0,
									high=255,
									shape=(41, 41, 1),	# shape=(HEIGHT, WIDTH, N_CHANNELS)
									dtype=np.uint32
								), 
								Discrete(2),
								{
									"gamma": 0.9,
									"model":{
										#"no_final_linear" : True,
										"conv_filters":filters
									}
								}),
			"pot_move_policy": (None,
								Box(low=0,
									high=255,
									shape=(41, 41, 1),	# shape=(HEIGHT, WIDTH, N_CHANNELS)
									dtype=np.uint32
								),
								Discrete(4),
								{
									"gamma": 0.0,
									"model":{
										#"no_final_linear" : True,
										"conv_filters":filters
									}
								}),
		},
		"policy_mapping_fn": policy_mapping_fn,
		# Optional list of policies to train, or None for all policies.
		"policies_to_train": None,
	},
	"framework": "tf2",
	"num_gpus": 1.0,
	"sgd_minibatch_size": 500,
	"train_batch_size": 5000,
	"num_workers": 6,
	"num_cpus_per_worker": 0.5,
	"num_envs_per_worker":10,
	"log_level": "INFO",
}

My reset returns this observations (shortened to first and last dict item for easy viewing). It is basically a grayscale image of shape(41,41,1). The last dimension is there for including more channels down the road. The environment alternates from high level steps(decision to move or do nothing) to multiple low level steps (currently only of one kind - move up,down,right,left).

Sample return from env.reset()

{'p_d_0_0': array([[[0],
        [0],
        [0],
        ...,
        [0],
        [0],
        [0]],

       [[0],
        [0],
        [0],
        ...,
        [0],
        [0],
        [0]],

       [[0],
        [0],
        [0],
        ...,
        [0],
        [0],
        [0]],

       ...,

       [[0],
        [0],
        [0],
        ...,
        [0],
        [0],
        [0]],

       [[0],
        [0],
        [0],
        ...,
        [0],
        [0],
        [0]],

       [[0],
        [0],
        [0],
        ...,
        [0],
        [0],
        [0]]], dtype=uint32),



...



'p_d_17_0': array([[[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       [[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       [[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       ...,

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]]], dtype=uint32)}

Sample obs for high level step (equivalent to return from reset)

{'p_d_0_10': array([[[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       ...,

       [[1],
        [1],
        [1],
        ...,
        [0],
        [0],
        [0]],

       [[1],
        [1],
        [1],
        ...,
        [0],
        [0],
        [0]],

       [[1],
        [1],
        [1],
        ...,
        [0],
        [0],
        [0]]], dtype=uint32),


...


'p_d_17_10': array([[[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       [[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       [[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       ...,

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]]], dtype=uint32)}

Sample obs for low level step

{'p_m_0_10': array([[[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       ...,

       [[1],
        [1],
        [1],
        ...,
        [0],
        [0],
        [0]],

       [[1],
        [1],
        [1],
        ...,
        [0],
        [0],
        [0]],

       [[1],
        [1],
        [1],
        ...,
        [0],
        [0],
        [0]]], dtype=uint32),


...


'p_m_17_10': array([[[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       [[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       [[0],
        [0],
        [0],
        ...,
        [1],
        [1],
        [1]],

       ...,

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]],

       [[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]]], dtype=uint32)}

Iā€™m a little late to the discussion, but in case @Blubberblub still has some issues, hereā€™s a loop that I ripped out from rollout that I use for post-processing the policies. I believe this loops mimics trainā€™s internal logic:

policy_agent_mapping = trainer.config['multiagent']['policy_mapping_fn']
for episode in range(100):
    print('Episode: {}'.format(episode))
    obs = sim.reset()
    done = {agent: False for agent in obs}
    while True: # Run until the episode ends
        # Get actions from policies
        joint_action = {}
        for agent_id, agent_obs in obs.items():
            if done[agent_id]: continue # Don't get actions for done agents
            policy_id = policy_agent_mapping(agent_id)
            action = trainer.compute_action(agent_obs, policy_id=policy_id)
            joint_action[agent_id] = action
        # Step the simulation
        obs, reward, done, info = sim.step(joint_action)
        if done['__all__']:
            break

The config you have shown here looks good to me, not sure why itā€™s not running. Perhaps you can provide the whole file as one?

2 Likes

Thanks a lot @rusu24edward for sharing this code. Looks like a better version of what i was doing with mine. Sorry i canā€™t share the complete project here since my environment has a lot of dependencies and contains business logic for a prototype iā€™m trying to build. Thanks for having a look at the snippets though!

Hi @Blubberblub,

I do not see an issue with what you posted. I can think of two possibilities that may be affecting you but they are just guesses.

  1. There is some condition in your environment that causes it to return an array instead of a multiagent dictionary . This could be for either obs, reward, done, or info.

  2. You are returning info for an agent that has been marked done.

If you have more of the stack trace to share I might be able to help l. Feel free to post it here or pm me.