Inconsistency when configuring selfplay with shared parameters

jlin816 · November 22, 2022, 10:35pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi, I’m working with a turn-based game environment. I’ve successfully trained a selfplay agent with a single agent environment: on each step, the environment alternates between showing the obs for player_0 and player_1.

Now, I’m trying to replicate the same selfplay results with a MultiAgentEnv, but it looks like I’m missing some critical difference when switching to a MultiAgentEnv. I’ve extended the env so that the step function now returns the observation nested under the “player_0” or “player_1” key depending on the turn:

{
   "player_0": <{obs dict from before}>
}

and I’ve added the multiagent key to my training script:

ray.init(local_mode=False)
config = {
        "num_workers": 0,
        "num_gpus": 1,
        "num_envs_per_worker": 128,
        "framework": "torch",
        "disable_env_checking": True,
        "_disable_preprocessor_api": True,
        "_disable_action_flattening": True,
        "env": "my_selfplay_env",
        "model": {
          "custom_model": MyModel,
          "custom_action_dist": MyActionDist,
         },
        "multiagent": {
          "policies": {
              "main": PolicySpec(),
          },
          "policy_mapping_fn": lambda agnt_id, episode, worker, **kwargs:  "main",
          "policies_to_train": ["main"],
      },
}
trainer = PPOTrainer(config)
...

I would expect this to do exactly the same thing as the selfplay setup I had before, but it looks like my results are higher variance and the rewards are lower using the new setup, across multiple seeds. This makes me think that I’ve misconfigured this somehow such that there’s the policy is being optimized differently in this new setup (e.g. is this training one set of parameters for both players, like I’d expect?). What am I missing?

arturn · November 30, 2022, 9:58pm

Hi @jlin816 ,

Your setup looks fine to me. Could it be that your observations are mixed up? So that each player sees what the other is supposed to see? I gather that your experiments executes fine and actually learns something, but just not as good as before, right? Since everything looks very good, it’s hard to say from here Can you post a comparison of metrics? Asking just out of curiosity.

jlin816 · December 2, 2022, 8:29am

Hi @arturn,

Thanks so much for the response! I figured out the issue, posting here in case it helps others I was only giving the agent that acted on the last timestep the final reward, so I think effectively we weren’t learning from half of the actions in every game. Setting the reward at the final timestep to {"player_0": reward, "player_1": reward} reproduced the behavior from selfplay.

arturn · December 2, 2022, 6:39pm

Awesome, glad to hear that!

Topic		Replies	Views
How to share obsrvations and rewards in Multi-Agent ExternallEnv? RLlib	2	425	July 27, 2022
MultiAgent env wrong structures RLlib	1	24	November 28, 2024
Proper way of setting up a turn-based action-masked multiagent PPO Configure Algorithm, Training, Evaluation, Scaling	0	148	April 5, 2024
Evaluating multi-agent policies trained with self-play RLlib	2	553	March 16, 2022
Issues with MultiAgentEnv RLlib	1	345	September 7, 2023

Inconsistency when configuring selfplay with shared parameters

Related topics