I implemented an evaluation callback that tracks an agent's ELO against its past counterparts. How did I do?

For the past few weeks, I’ve been preparing a self-play configuration for a challenging zero-sum reinforcement learning environment I’m attempting to solve. I’ve wanted to track, qualitatively, how well the agent is improving with respect to its past versions. I’ve written a neat script to do just that, and I’ve added it to GitHub in case it would be useful to others.

The Colab notebook in the repo tests out the ELO evaluator on a simple Tic Tac Toe environment I put together, generating a list of ELO ratings for the bot and its past versions that updates automatically as learning and checkpointing progress. Here’s an example output:

╭───────────────────┬──────────╮
│ Checkpoint        │   Rating │
├───────────────────┼──────────┤
│ main              │     1479 │
│ checkpoint_000000 │      672 │
│ checkpoint_000001 │      460 │
│ checkpoint_000002 │     1434 │
│ checkpoint_000003 │     1456 │
│ checkpoint_000004 │     1458 │
│ checkpoint_000005 │     1443 │
│ checkpoint_000006 │     1429 │
│ checkpoint_000007 │     1352 │
│ checkpoint_000008 │     1419 │
│ checkpoint_000009 │     1444 │
│ checkpoint_000010 │     1437 │
│ checkpoint_000011 │     1421 │
│ checkpoint_000012 │     1418 │
│ checkpoint_000013 │     1417 │
│ checkpoint_000014 │     1416 │
│ checkpoint_000015 │     1420 │
│ checkpoint_000016 │     1420 │
│ checkpoint_000017 │     1431 │
│ checkpoint_000018 │     1437 │
│ checkpoint_000019 │     1460 │
│ checkpoint_000020 │     1474 │
│ checkpoint_000021 │     1453 │
│ checkpoint_000022 │     1467 │
│ checkpoint_000023 │     1480 │
╰───────────────────┴──────────╯

I’ve written it out modularly enough that it should be fairly trivial to adapt it to TrueSkill, among other things. It supports probability weighting opponents along the lines of…

W_i = e^-((R_A - R_i)^2 / (2 * sigma^2))

…to encourage interesting matches, which just involves sending an optional vector of weights into sample_elo. You can comment this line out for random matchmaking.

Thoughts? Anything you’d change, or any issues that you notice?

Your approach—tracking ELO ratings for self-play agents and their checkpoints—is well-aligned with best practices in league-based self-play RL, as seen in RLlib’s OpenSpiel and league-based examples. Using ELO or TrueSkill to evaluate progress and opponent selection is a common and effective method for qualitative assessment of agent improvement over time. The weighting scheme you mention (using a Gaussian function of rating difference) is a reasonable way to bias match selection toward similarly skilled opponents, which can accelerate learning and provide more meaningful evaluations (RLlib self-play league-based example, RLlib self-play callback).

No major issues stand out from your description. RLlib’s own league-based self-play callbacks use similar logic for opponent selection, checkpointing, and tracking win rates or ELO-like metrics. If you want to further align with RLlib’s practices, consider integrating your ELO evaluator as a callback or evaluation function within the training loop, and ensure that policy weights are properly synced across workers when new checkpoints are added (RLlib self-play callback, RLlib self-play discussion). Would you like a step-by-step breakdown or code integration example?

Sources:

Hint: Mention @RunLLM in the post for followups.

Happy to discuss my approach with anyone who’s interested, or working on something similar.