Inconsistency when configuring selfplay with shared parameters

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, I’m working with a turn-based game environment. I’ve successfully trained a selfplay agent with a single agent environment: on each step, the environment alternates between showing the obs for player_0 and player_1.

Now, I’m trying to replicate the same selfplay results with a MultiAgentEnv, but it looks like I’m missing some critical difference when switching to a MultiAgentEnv. I’ve extended the env so that the step function now returns the observation nested under the “player_0” or “player_1” key depending on the turn:

   "player_0": <{obs dict from before}>

and I’ve added the multiagent key to my training script:

config = {
        "num_workers": 0,
        "num_gpus": 1,
        "num_envs_per_worker": 128,
        "framework": "torch",
        "disable_env_checking": True,
        "_disable_preprocessor_api": True,
        "_disable_action_flattening": True,
        "env": "my_selfplay_env",
        "model": {
          "custom_model": MyModel,
          "custom_action_dist": MyActionDist,
        "multiagent": {
          "policies": {
              "main": PolicySpec(),
          "policy_mapping_fn": lambda agnt_id, episode, worker, **kwargs:  "main",
          "policies_to_train": ["main"],
trainer = PPOTrainer(config)

I would expect this to do exactly the same thing as the selfplay setup I had before, but it looks like my results are higher variance and the rewards are lower using the new setup, across multiple seeds. This makes me think that I’ve misconfigured this somehow such that there’s the policy is being optimized differently in this new setup (e.g. is this training one set of parameters for both players, like I’d expect?). What am I missing?

Hi @jlin816 ,

Your setup looks fine to me. Could it be that your observations are mixed up? So that each player sees what the other is supposed to see? I gather that your experiments executes fine and actually learns something, but just not as good as before, right? Since everything looks very good, it’s hard to say from here :frowning: Can you post a comparison of metrics? Asking just out of curiosity.

Hi @arturn,

Thanks so much for the response! I figured out the issue, posting here in case it helps others :slight_smile: I was only giving the agent that acted on the last timestep the final reward, so I think effectively we weren’t learning from half of the actions in every game. Setting the reward at the final timestep to {"player_0": reward, "player_1": reward} reproduced the behavior from selfplay.

1 Like

Awesome, glad to hear that!