[MADDPG] using policies_to_train

Hi,
I have some trouble using the policies_to_train setting in the multiagent config with MADDPG.
When I try training only one of the policies with this setting I get an exception that ‘obs_1’ can’t be found.
Here is a minimal example to reproduce this, based on the two_step_game example code for MADDPG:

from gym.spaces import Discrete
import ray
from ray import tune
from ray.rllib.examples.env.two_step_game import TwoStepGame

if __name__ == "__main__":
    config = {
        "env_config": {
            "actions_are_logits": True,
        },
        "multiagent": {
            "policies": {
                "pol1": (None, Discrete(6), TwoStepGame.action_space, {
                    "agent_id": 0,
                    # This fixes the problem
                    # "use_local_critic": True
                }),
                "pol2": (None, Discrete(6), TwoStepGame.action_space, {
                    "agent_id": 1,
                }),
            },
            "policy_mapping_fn": lambda x: "pol1" if x == 0 else "pol2",
            "policies_to_train": ['pol1']  # This causes an exception
        },
        "framework": "tf",
        # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
        "num_gpus": 0,
    }
    ray.init(num_cpus=2)
    stop = {
        "episode_reward_mean": 7,
        "timesteps_total": 50000,
        "training_iteration": 200,
    }
    config = dict(config, **{
        "env": TwoStepGame,
    })
    results = tune.run(# MADDPGTrainer,
                       'contrib/MADDPG',
                       stop=stop, config=config, verbose=1)
    ray.shutdown()

I ran this with 2.0.0.dev0

I think the problem might be that I am using "use_local_critic": False for both agents. Now when I am training only one policy, the shared critic still expects the observations of the policy that is not being trained, but due the the policies_to_train setting these observations are not available.

When I set "use_local_critic": True for the agent that is training, this exception does not occur. However, this does not solve my problem because my use case is actually the following:

  • Train 2 MADDPG agents normally with shared critics and self-play
  • Restore the checkpoint but now only train one of the agents, while freezing the other

Is there any way to do this?
The above behavior seems like a bug.
This issue seems similar, but does not solve my issue.