Hi There,
I’m doing a similar thing - turn based card game with MultiAgentEnv and self play with PPO. I’d be interested in comparing your approach to self play. Mine is based on what is described here: How to Implement Self Play with PPO? [rllib] · Issue #6669 · ray-project/ray · GitHub.
I have positive results vs. previous attempts. After several hundred training iterations my agent is able to beat a simple rule based agent I’ve written 40% of the time, something previous agents haven’t even been close to.
I’m running into the same problem as you with updating weights. I’m training policy 1. Policies 2-4 are supposed to contain old versions of policy 1. I shift the weights each time that policy 1 achieves >55% win rate over the course of one training iteration. But as you can see the average policy rewards from each of the other three policies is significantly lower than I’d expect - surely they should be roughly equal to policy 1’s average reward. Per episode win rate is around 80% as well, I’d expect it to be around 50% since the other 3 policies are mean to be similar in skill to the trained policy. These are the sorts of results I’d expect of a trained agent verse a random action agent.
Interestingly if I restart training with a saved checkpoint (which I have to do due to a memory issue: PPO trainer eating up memory) the weights seem to propagate properly and the win rate is around 50/60%