I implemented an evaluation callback that tracks an agent's ELO against its past counterparts. How did I do?

RunLLM · March 6, 2026, 9:33pm

Your approach—tracking ELO ratings for self-play agents and their checkpoints—is well-aligned with best practices in league-based self-play RL, as seen in RLlib’s OpenSpiel and league-based examples. Using ELO or TrueSkill to evaluate progress and opponent selection is a common and effective method for qualitative assessment of agent improvement over time. The weighting scheme you mention (using a Gaussian function of rating difference) is a reasonable way to bias match selection toward similarly skilled opponents, which can accelerate learning and provide more meaningful evaluations (RLlib self-play league-based example, RLlib self-play callback).

No major issues stand out from your description. RLlib’s own league-based self-play callbacks use similar logic for opponent selection, checkpointing, and tracking win rates or ELO-like metrics. If you want to further align with RLlib’s practices, consider integrating your ELO evaluator as a callback or evaluation function within the training loop, and ensure that policy weights are properly synced across workers when new checkpoints are added (RLlib self-play callback, RLlib self-play discussion). Would you like a step-by-step breakdown or code integration example?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Evaluating multi-agent policies trained with self-play RLlib	2	600	March 16, 2022
Board game self-play PPO RLlib	15	4378	May 4, 2021
Recommended way to evaluate training results RLlib	0	3329	June 12, 2021
Rllib multi agent connect 4 issues - why does it 'forget' what it learnt? RLlib	0	260	November 27, 2023
Custom logging of agent behaviors RLlib	5	503	November 1, 2021

I implemented an evaluation callback that tracks an agent's ELO against its past counterparts. How did I do?

Related topics