For the past few weeks, I’ve been preparing a self-play configuration for a challenging zero-sum reinforcement learning environment I’m attempting to solve. I’ve wanted to track, qualitatively, how well the agent is improving with respect to its past versions. I’ve written a neat script to do just that, and I’ve added it to GitHub in case it would be useful to others.
The Colab notebook in the repo tests out the ELO evaluator on a simple Tic Tac Toe environment I put together, generating a list of ELO ratings for the bot and its past versions that updates automatically as learning and checkpointing progress. Here’s an example output:
╭───────────────────┬──────────╮
│ Checkpoint │ Rating │
├───────────────────┼──────────┤
│ main │ 1479 │
│ checkpoint_000000 │ 672 │
│ checkpoint_000001 │ 460 │
│ checkpoint_000002 │ 1434 │
│ checkpoint_000003 │ 1456 │
│ checkpoint_000004 │ 1458 │
│ checkpoint_000005 │ 1443 │
│ checkpoint_000006 │ 1429 │
│ checkpoint_000007 │ 1352 │
│ checkpoint_000008 │ 1419 │
│ checkpoint_000009 │ 1444 │
│ checkpoint_000010 │ 1437 │
│ checkpoint_000011 │ 1421 │
│ checkpoint_000012 │ 1418 │
│ checkpoint_000013 │ 1417 │
│ checkpoint_000014 │ 1416 │
│ checkpoint_000015 │ 1420 │
│ checkpoint_000016 │ 1420 │
│ checkpoint_000017 │ 1431 │
│ checkpoint_000018 │ 1437 │
│ checkpoint_000019 │ 1460 │
│ checkpoint_000020 │ 1474 │
│ checkpoint_000021 │ 1453 │
│ checkpoint_000022 │ 1467 │
│ checkpoint_000023 │ 1480 │
╰───────────────────┴──────────╯
I’ve written it out modularly enough that it should be fairly trivial to adapt it to TrueSkill, among other things. It supports probability weighting opponents along the lines of…
W_i = e^-((R_A - R_i)^2 / (2 * sigma^2))
…to encourage interesting matches, which just involves sending an optional vector of weights into sample_elo. You can comment this line out for random matchmaking.
Thoughts? Anything you’d change, or any issues that you notice?