Evaluating multi-agent policies trained with self-play

Hi all, I’ve set up training between two agents using league based self-play and I wanted to evaluate main policies against one another during evaluation. At each training iteration, new agents get added to the opponent pool and the policy mapping function is updated. However, I’m running into an issue where the updated policy mapping function is used for evaluation instead of the one declared in the config.

Here’s the policy_mapping_fn

config["evaluation_config"] = {
    "multiagent": {
        "policy_mapping_fn": lambda x: x
    }
}

and in the callbacks:

def on_train_result(self, *, trainer, result, **kwargs):
     ...
     def policy_mapping_fn(agent_id, episode, worker, **kwargs):
            if (episode.episode_id % 2) == 0:
                if agent_id == "attacker":
                    agents = list(range(0, self.opponents[agent_id] + 1))
                    agent_selection = self.rng.choice(agents).item()
                    return f"{agent_id}_v{agent_selection}"
                elif agent_id == "defender":
                    return "defender"
            else:
                if agent_id == "attacker":
                    return "attacker"
                elif agent_id == "defender":
                    agents = list(range(0, self.opponents[agent_id] + 1))
                    agent_selection = self.rng.choice(agents).item()
                    return f"{agent_id}_v{agent_selection}"

        new_policy = trainer.add_policy(
            policy_id=new_pol_id,
            policy_cls=type(trainer.get_policy(agent)),
            config=config,
            policy_mapping_fn=policy_mapping_fn,
            action_space=trainer.get_policy(agent).action_space
        )

Is there a way to check if the episode is in evaluation, to properly map the policies? Or any other ideas would be appreciated.

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

so just so that I understand correctly, your issue is that you have a self play setup where you add new policies to your league, but then evaluation ends up happening on all of the policies that are added, not just the initial policies in your league, right?

Right, I want to evaluate on just the initial policies (the trainable policies), not the additional policies that are added to the league.