Evaluating multiple policies in multiagent

Hi all,

if I understand correctly the supported way to evaluate the performance of a trained agent is to use the rollout function.
I have trained several policies that use different algorithms, e.g. one policy with PPO and one with DDPG.
Is there any way to have these different policies play against each other in the same multiagent environment?

Hey @PavelC , great question!
For pure evaluation, you could actually use any Trainer (b/c you don’t care about how training updates are done) and set it up as “multiagent”, like in this example here:

ray.rllib.examples.multi_agent_custom_policy.py

Then call .evaluate() on your Trainer instance. Given that you have the correct agent ID->policy ID mapping function specified, this should have both policies play against each other in your env.

Thank you for taking a look Sven, this helps a lot!

Now I was not 100% sure how to get the different stored policies into the same trainer. Here is how I would do that, does that seem generally the right way to go? This would be for num_agents = len(checkpoint_paths) different trained policies

    # Load original weights by way of trainers
    restored_trainers = []  # Contains trainers for all agents except the first one
    for i in range(num_agents):
        trainer = trainer_classes[i](env, config)
        trainer.restore(checkpoint_paths[i])
        restored_trainers.append(trainer)

    # Set up the config for the eval trainer
    eval_config = ... # initialize config similar to training

    # Set policies according to loaded agents
    for i, trainer in enumerate(restored_trainers):
        eval_config['multiagent']['policies'][f'policy_{i}'] = (
            type(trainer.get_policy(f'policy_{i}')),
            env.observation_space_dict[i],
            env.action_space_dict[i],
            {'agent_id': i}
        )

    eval_config['evaluation_num_episodes'] = 100
    # Create the trainer to perform eval in
    eval_trainer = PPOTrainer(config=eval_config, env=env)

    # Set restored weights
    for i, trainer in enumerate(restored_trainers):
        # Set the weights for the other policy
        eval_trainer.set_weights(trainer.get_weights(f'policy_{i}'))

    # Perform eval
    results = eval_trainer.evaluate()

To summarize, the idea is (1) load original checkpoints with trainer.restore() (2) set the correct policy, e.g. PPOTFPolicy at config.multiagent.policies.policy_i[0] for the config used in the evaluation trainer (3) create any trainer with this multiagent config (4) Set the weights from the restored trainers.

This seems to work, but there is one thing that is still acting weirdly: I want to set the number of episodes (or alternatively timesteps) for evaluation, so I do eval_config['evaluation_num_episodes'] = 100. However, if I look at results['episodes_this_iter'] after evaluate() the value seems to be mostly unrelated to that config setting. Am I missing something here?

Hi @PavelC,

I think results['episodes_this_iter'] is returning a count of training episodes. There should be a top level key results["evaluation"] that holds a dictionary with evaluation metrics.

Yes, sorry @mannyv , you’re right, I meant results[‘evaluation’][‘episodes_this_iter’], which seeems to be 50x - 100x what I put for ‘evaluation_num_episodes’ in the config. Likewise, the the length of the lists in ‘hists_stats’, e.g. results[‘evaluation’][‘hist_stats’][episode_reward’] have ‘episodes_this_iter’ many elements