Agent_key and policy_id mismatch on multiagent ensemble training

Hi, i’m trying to do an multi-agent ensemble training, where each agent has several policies, and it randomly pick one policy to execute in each episode. Here is the code snippet I used for implementing this logic

import importlib
import random
from ray.rllib.env import PettingZooEnv

def env_creator():
        env_id = ''pettingzoo.mpe.simple_crypto_v2''
        return importlib.import_module(env_id).env()

test_env = PettingZooEnv(env_creator())
agents_id = test_env.agents
main_policies = {}
        for i, agent_id in enumerate(agents_id):
            for j in range(ensemble_size):
                main_policies[f'{agent_id}_{j}'] = (PPOTorchPolicy,
                                                    obs_space,
                                                    act_space,
                                                    {"framework": "torch"})
main_config["multiagent"] = {
            "policies": policies,
            "policy_mapping_fn": lambda agent_id: f'{agent_id}_{random.randint(0, ensemble_size - 1)}',
            "policies_to_train": list(main_policies.keys())
        }

The training with this settings works for most of the training steps, but i will randomly get an assertion error from ray/rllib/evaluation/collectors/simple_list_collector.py: line 481, in add_init_obs:
assert self.agent_key_to_policy_id[agent_key] == policy_id

I’m using rllib 1.2.0, and pettingzoo 1.4.2+. It takes quite some time to run the training for reproducing this error. I would appreciate if you have any insight on what may be the cause of this problem, and please also verify that i specified the training config correctly. Thanks.

Hey @MachengShen . Thanks for filing this. This could possibly be a bug in RLlib. I’ll try to run your example and see whether I can reproduce this.

Getting this error when trying to reproduce:

Traceback (most recent call last):
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py", line 1477, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/sven/Dropbox/Projects/ray/rllib/examples/issues/discuss_995.py", line 17, in <module>
    test_env = PettingZooEnv(env_creator())
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/rllib/env/wrappers/pettingzoo_env.py", line 87, in __init__
    "Observation spaces for all agents must be identical. Perhaps " \
AssertionError: Observation spaces for all agents must be identical. Perhaps SuperSuit's pad_observations wrapper can help (useage: `supersuit.aec_wrappers.pad_observations(env)`
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/node.py", line 931, in _kill_process_type
    wait=wait)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/node.py", line 983, in _kill_process_impl
    process.wait(timeout_seconds)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.7/subprocess.py", line 1647, in _wait
    time.sleep(delay)
KeyboardInterrupt

Could you provide a complete, self-sufficient script for me to debug? I had to fill in several “blanks” (like obs_space, action_space, config, etc…) to make the above snippet work, which adds sources for errors and unreproducability issues. Thanks.

Thanks for the prompt reply, sorry for the inconvenience. Here is a minimal working script I ‘extracted’ from my code. I ran this script, and got the “assert self.agent_key_to_policy_id[agent_key] == policy_id” error at step 3361. I believe this number of steps to encounter the error is random, but probably takes something on the order of thousands of steps to reproduce it.

with rllib==1.2.0, pettingzoo==1.4.2, supersuit==2.3.0:

import importlib
import random
import copy
from numpy import float32
import ray
from ray.tune.registry import register_env
from ray.rllib.env import PettingZooEnv
from ray.rllib.agents.ppo import PPOTorchPolicy
from ray.rllib.agents.registry import get_agent_class
from supersuit import (dtype_v0,
                       pad_observations_v0)

def env_creator():
        env_id = 'pettingzoo.mpe.simple_crypto_v2'
        env = importlib.import_module(env_id).env()
        env = dtype_v0(env, dtype=float32)
        env = pad_observations_v0(env)
        return env

alg_name = 'PPO'
ENSEMBLE_SIZE = 3
test_env = PettingZooEnv(env_creator())
obs_space = test_env.observation_space
act_space = test_env.action_space
agents_id = test_env.agents

config = copy.deepcopy(get_agent_class(alg_name)._default_config)
config["framework"] = "torch"
config["log_level"] = "INFO"
config["num_workers"] = 2
config["num_cpus_per_worker"] = 1
config["rollout_fragment_length"] = 100
config["train_batch_size"] = 2000
config["sgd_minibatch_size"] = 256
config["entropy_coeff"] = 0.01
config["lambda"] = 0.9
config["vf_clip_param"] = 50
config["num_sgd_iter"] = 15

main_policies = {}
for i, agent_id in enumerate(agents_id):
    for j in range(ENSEMBLE_SIZE):
        main_policies[f'{agent_id}_{j}'] = (PPOTorchPolicy,
                                            obs_space,
                                            act_space,
                                            {"framework": "torch"})

config["multiagent"] = {
            "policies": main_policies,
            "policy_mapping_fn": lambda agent_id: f'{agent_id}_{random.randint(0, ENSEMBLE_SIZE - 1)}',
            "policies_to_train": list(main_policies.keys()),
        }

ray.init(num_cpus=2)
register_env('simple_crypto',
             lambda config: PettingZooEnv(env_creator()))
trainer_class = get_agent_class(alg_name)
main_trainer = trainer_class(env='simple_crypto',
                             config=config)
for i in range(200000):
    print(f'step {i}')
    main_trainer.train()

I am wondering if you were able to reproduce the error with the updated script, please let me know if i can help, thanks!

Checking again right now. Thanks so much for the repro script! It’s working fine. Trying to get it to break.

Running for 5000+ iters now and hasn’t broken yet.
One thought though is that the assertion would fail only if your agent-id-to-policy-id mapping function would return a different policy given the same agent ID on two different calls. This is not supported. The same agent ID must always point to/return the same policy ID.
Could you double check, whether your policy_mapping_fn could possibly do this?

Thanks for the comment, but I thought same agent_id mapping to different policy was supported. And that’s exactly intended in my case, I map the same agent_id randomly with equal probability (1/3) to three different policies for ensemble training. If this was not supported, I would be surprised that this assertation error has not been triggered after running for thousands of steps.

Is there any workaround to make it trainable even with same agent_id mapping to different policies?

I’m trying again to break it now.
You were right about the agent->policy mapping supporting 1 agent mapping to > 1 policies. We are doing this already in our e.g. multiagent CartPole example script.
Also, we can probably get rid of this assert at this part of the code (add_init_obs) as this is only called at the beginning of the episode. Throughout the episode, the mapping will not be updated anyways, so it should all be fine.

PR: [RLlib] Issue: Agent_id -> Policy_id mapping should not need to be fixed between episodes. by sven1977 · Pull Request #15020 · ray-project/ray · GitHub

Let’s see what the test cases say.