Hi, I’ve implemented a multiagent version of connect 4 and i’m trying to train it with PPO through self-play.
At each turn the environment returns the observation and reward for the player that will move next.
The observations are:
the board configuration from the player point of view (for example if the player 1 sees the bottom row as [0,0,1,0,2,0,0], the player 2 will see it as [0,0,2,0,1,0,0]. I’ve done this in order to use self-play)
The action masking that I’ve used in my custom model to remove invalid actions.
After the last winning move, the environment return observations for both players: +1 for the winning player and -1 for the losing one. I’ve also randomized the start ( e.g. player 1 can start as the second player to move) so the player 1 will be able to see all possible board configurations.
The only policy that is learning is the player 1 policy.
I’ve tried to implement self-play by using past versions of the player 1 as opponents. In my case, I have 5 opponent agents. The first opponent agent has the latest weights from player 1, the second opponent agent has the previous weights and so on. I’ve tried to update weights in 2 ways:
- Every N timestep.
- Everytime the player 1 defeats player 2 a certain number of time.
The problem is that my first agent does not learn. I’ve used a minimax algorithm with depth 1 to evaluate the model, but after 10m steps it is still not able to achieve not even 50 win over 100.
When I’ve checked the tensorboard graph I’ve noticed that the player 1, after a short initial period, will beat the player 2 almost everytime (also right after i’ve updated the weights of player 2). Did i miss any important point in this implementation that could cause this problem?