Board game self-play PPO

Hi, I’ve implemented a multiagent version of connect 4 and i’m trying to train it with PPO through self-play.
At each turn the environment returns the observation and reward for the player that will move next.
The observations are:

  1. the board configuration from the player point of view (for example if the player 1 sees the bottom row as [0,0,1,0,2,0,0], the player 2 will see it as [0,0,2,0,1,0,0]. I’ve done this in order to use self-play)

  2. The action masking that I’ve used in my custom model to remove invalid actions.

After the last winning move, the environment return observations for both players: +1 for the winning player and -1 for the losing one. I’ve also randomized the start ( e.g. player 1 can start as the second player to move) so the player 1 will be able to see all possible board configurations.
The only policy that is learning is the player 1 policy.
I’ve tried to implement self-play by using past versions of the player 1 as opponents. In my case, I have 5 opponent agents. The first opponent agent has the latest weights from player 1, the second opponent agent has the previous weights and so on. I’ve tried to update weights in 2 ways:

  1. Every N timestep.
  2. Everytime the player 1 defeats player 2 a certain number of time.

The problem is that my first agent does not learn. I’ve used a minimax algorithm with depth 1 to evaluate the model, but after 10m steps it is still not able to achieve not even 50 win over 100.
When I’ve checked the tensorboard graph I’ve noticed that the player 1, after a short initial period, will beat the player 2 almost everytime (also right after i’ve updated the weights of player 2). Did i miss any important point in this implementation that could cause this problem?

Hi There,

I’m doing a similar thing - turn based card game with MultiAgentEnv and self play with PPO. I’d be interested in comparing your approach to self play. Mine is based on what is described here: How to Implement Self Play with PPO? [rllib] · Issue #6669 · ray-project/ray · GitHub.

I have positive results vs. previous attempts. After several hundred training iterations my agent is able to beat a simple rule based agent I’ve written 40% of the time, something previous agents haven’t even been close to.

I’m running into the same problem as you with updating weights. I’m training policy 1. Policies 2-4 are supposed to contain old versions of policy 1. I shift the weights each time that policy 1 achieves >55% win rate over the course of one training iteration. But as you can see the average policy rewards from each of the other three policies is significantly lower than I’d expect - surely they should be roughly equal to policy 1’s average reward. Per episode win rate is around 80% as well, I’d expect it to be around 50% since the other 3 policies are mean to be similar in skill to the trained policy. These are the sorts of results I’d expect of a trained agent verse a random action agent.

Interestingly if I restart training with a saved checkpoint (which I have to do due to a memory issue: PPO trainer eating up memory) the weights seem to propagate properly and the win rate is around 50/60%

New users aren’t allowed to post more than 2 links, so here’s my self-play implementation yaniv-rl/ at e4ac312e3cf05d68d80a3b93e5efdfa967712968 · york-rwsa/yaniv-rl · GitHub

I think I figured it out. The remote workers which gather all the samples to train from don’t seem to get their policies synced. I guess that might make sense since they’re not being trained and nobody knows they’re being updated. From ray.rllib.optimizers.torch_distributed_data_parallel_optimizer — Ray 0.8.5 documentation

    # Sync up the weights. In principle we don't need this, but it doesn't
    # add too much overhead and handles the case where the user manually
    # updates the local weights.
    if self.keep_local_weights_in_sync:
        with self.sync_up_timer:
            weights = ray.put(self.workers.local_worker().get_weights())
            for e in self.workers.remote_workers():

I think the more up to date way to sync weights is like so (taken from ray/ at cdbaf930ab4bbe8138eae16f15967c5eb9974385 · ray-project/ray · GitHub)

        weights = ray.put(self.trainer.workers.local_worker().save())
            lambda w: w.restore(ray.get(weights))

Thanks for the answer,
I’m following the same idea from the link that you’ve send in the first answer and my graphs looks exactly like yours. The function that i’ve used to update the weights is the following:

def multiagent_self_play(trainer: Type[Trainer]):
    new_weights = trainer.get_policy("player1").get_weights()
    for opp in Config.OPPONENT_POLICIES:
        prev_weights = trainer.get_policy(opp).get_weights()
        new_weights = prev_weights

What you said about the remote workers is interesting, i didn’t thought about that. The only problem is that i also tried to train my network by using “num_workers” = 0. With this setting, no remote worker should be created and all the rollout is done on the local trainer. Unfortunately the results are the same for me

Hmm shame. This totally fixed my issues and self-play seems to be working now:

Might be worth giving it a go with multiple workers and syncing the weights. I’m not sure entirely how this framework works so can’t give the best advice now sorry

Ok I’ve tried your solution, but it looks like it doesn’t work for me.

The score in the graph are reset everytime the weights are copied from a policy to another. I expected a behaviour similar to the one in the first 300k steps, but after a while it just became flat and it does not improve during the evaluation phase. Furthermore with the code that you have proposed I’m getting this warning:

    WARNING -- Cannot restore an optimizer's state 
    for tf eager! Keras is not able to save the v1.x optimizers 
    (from tf.compat.v1.train) since they aren't compatible with checkpoints.

since i’m working with tensorflow 2.4.0
@sven1977 may i ask you if you have some more info about this? I mean if there is a way to synchronize the updated policy weights with the rollout_workers

Hi all. I am, also, interested to know if we should sychronize the weights on self play and which is the best way to do so. Also, it would be nice if someone could tell us if this is necessary when using “num_workers:0”.

Also I want to ask @Rory if he can help me try his solution.
I am working on a self-play multiagent setting with PPO and I have the weights changing in the on train result callback. The trainable policy is “shared_policy_1” and the policy in which the self-play weights are copied is the “shared_policy_2”.
So, I do something like that (here is a simplified version, with “men” being a dictionary with previous policies’ weights):

   class MyCallbacks(DefaultCallbacks):

        def on_train_result(self, *, trainer, result: dict, **kwargs):
            print("trainer.train() result: {} -> {} episodes".format(
                trainer, result["episodes_this_iter"]))
            men[one_key] = trainer.get_policy("shared_policy_1").get_weights()  #saving weights in dictionary
            trainer.set_weights({"shared_policy_2": men[some_key]})  #loading weights from dictionary

Should I do the weights syncing in the callback like that?

men[some_key] = ray.put(self.trainer.workers.local_worker().save())
            lambda w: w.restore(ray.get(men[some_key]))

or like that I overwrite the weghts from the trainable policy “shared_policy_1”, that I don’t have to sync I suppose. I am new to RLlib, so I don’t understand the ray.put usage exactly.
Thanks in advance.
Best regards,

Hi George,

I’m not entirely sure, but as far I as I could figure out you still need to sync the weights. I’m pretty new to rllib too and am just using it atm rather than actually understanding it! From experience even when running in local mode not syncing the weights meant that the self play weights weren’t updated.

I use pretty much the same code as you :slight_smile: and it seems to work

1 Like

Hey everyone @Rory , @george_sk , @ColdFrenzy . Thanks for this cool discussion on self-play w/ board games :slight_smile:

  • It’s correct that the weights-to-worker sync only happens at Trainer instantiation time and after being trained via Trainer.train() (which the “behind” policies in a self-play setup are not!). So one would have to implement @Rory 's suggestion of making sure the rollout workers are also getting the correct weights.


    def set_weights(self, weights: Dict[PolicyID, dict]):
        self.workers.local_worker().set_weights(weights)  # <-- only the local worker's policy's weights are set!

@Rory, these TB plots look really cool. May I ask you to send a PR with a short example script demonstrating your self-play setup? This would be incredibly useful for the community as it’s some important RL concept that we still don’t have an example for in the repo. Let me know, how I can help.

Hi @sven1977, I’d be happy to make an example at some point, once my current project has finished. Can you recommend an environment to use?

Hey @Rory , of course, no rush, and thanks for offering this! I would love to have an open-Spiel example, but we currently don’t have an adapter for it (even though, we may get one soon from another contributor). A simpler env would be to just use your connect-4?

Hi @sven1977. Thanks for the reply. A quick question to clarify the correct self-play scheme.

If “menagerie” is a dictionary of previous policies, “shared_policy_1” is the trained policy (“policies_to_train”: [“shared_policy_1”]) and “shared_policy_2” is the self-play policy does the below code correctly sync the weights?
Also as I understand this is necessary even for zero number of workers. Is this correct?

class MyCallbacks(DefaultCallbacks):

        def on_train_result(self, *, trainer, result: dict, **kwargs):
            print("trainer.train() result: {} -> {} episodes".format(
                trainer, result["episodes_this_iter"]))
            menagerie[one_key] = trainer.get_policy("shared_policy_1").get_weights()  #saving weights in dictionary
            trainer.set_weights({"shared_policy_2": menagerie[some_key]})  #loading weights from dictionary

            weights = ray.put(trainer.workers.local_worker().save())
                lambda w: w.restore(ray.get(weights))


Hey @george_sk ,
this looks good. Actually, yeah, not sure why Trainer.set_weights() only does the local worker and none of the remote ones:

    def set_weights(self, weights: Dict[PolicyID, dict]):

So yes, you would have to do this additional for-each-worker thing.
Maybe you can also try (this would just do the weights, not the entire worker’s pickled state):

local_weights = trainer.workers.local_worker().get_weights()
trainer.workers.foreach_worker(lambda worker: worker.set_weights(local_weights))
1 Like

Hi @george_sk,

Out of interest, how are you storing and sampling your menagerie? How many historical policies do you store in it, and does the number you store have any impact on performance?

I’m using RLlib’s multiagent config to store a few and then only train one policy.

fwiw I use the trainer.set_weights function when shifting the weights about, and then push the local weights I’ve just changed to all the workers as Sven described.



Hi @Rory ,

I am using a dictionary where I store the weights in the on_train_result callback and I select a value that I use in the set_weights (I have two policies). As for the number I have 10 previous weights, but I am still searching for the best performance. I think that it depends also on the environment you use and there is no general answer. I use uniform sampling, but again this is something you can play with.