Rllib multi agent connect 4 issues - why does it 'forget' what it learnt?

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hi all - First time posting to the discussion group, I hope you can guide me on my learning journey!
I am trying to implement self-play for connect 4 with PettingZoo, in a similar form to the example provided on GitHub with openspiel. In addition to modifying which environment my agents are playing against, I’ve made few minor tweaks out of curiosity (I’ve built some trials with DQN instead of PPO, and some trials with fcnet and VisionNet). I’ve also implemented action masking.
The core of my experiment works (a main agent plays against a random agent, and when it reaches a 95% win rate, a new opponent is spawned). Instead of picking a random opponent from the pool to play against, I use a weighted average function to pick recently added opponents with higher probability.

Originally, I imagined that if I plotted the mean reward for the main policy, I would see it gradually learn to be better until it reaches a win rate of 95%, at which point I imagined that its win rate will start dropping (because of the introduction of a new opponent), slowly picking it up to 95%, then dropping, in such a cycle until the point where it can no longer beat itself. I understand that the approximation introduced by Neural Networks would result in some slight randomness.
But what I am witnessing is rather strange:

  1. The agent never beats the totally random player (named main_v0)
  2. The agent learns to beat some other players (look at main_v5 for example, before the 500K step mark), but then all of a sudden starts acting randomly against it (at timestep 1.25M, with a mean reward of 0). Why is it doing that?

My questions are:

  1. Is this expected?
  2. What can I do to make the training more stable?
  3. How can I view the value function of each of these policies? My bot still does some pretty stupid things
  4. Are there any metrics in Ray that allow me to look at the value function loss, for example?
  5. What’s the right mechanism in your experience to add/remove policies mid-training to play against? how do you keep track in your CallBack class which policies to keep and which ones to remove?
  6. When should I usually stop training? Does it make sense to take a checkpoint every X steps and then pick the most suitable one? I dont know how to analytically determine the number of steps needed for example.

Here’s a screenshot from Tensorboard to help explain. Thank you for helping me learn more!
Regards, Eyas