Your approach—tracking ELO ratings for self-play agents and their checkpoints—is well-aligned with best practices in league-based self-play RL, as seen in RLlib’s OpenSpiel and league-based examples. Using ELO or TrueSkill to evaluate progress and opponent selection is a common and effective method for qualitative assessment of agent improvement over time. The weighting scheme you mention (using a Gaussian function of rating difference) is a reasonable way to bias match selection toward similarly skilled opponents, which can accelerate learning and provide more meaningful evaluations (RLlib self-play league-based example, RLlib self-play callback).
No major issues stand out from your description. RLlib’s own league-based self-play callbacks use similar logic for opponent selection, checkpointing, and tracking win rates or ELO-like metrics. If you want to further align with RLlib’s practices, consider integrating your ELO evaluator as a callback or evaluation function within the training loop, and ensure that policy weights are properly synced across workers when new checkpoints are added (RLlib self-play callback, RLlib self-play discussion). Would you like a step-by-step breakdown or code integration example?
Sources:
Hint: Mention @RunLLM in the post for followups.