Question regarding example self-play implementation

Hi, I am trying to implement self-play. I have some queries regarding this script, particularly this chunk of code. ray/ at master · ray-project/ray · GitHub

main_rew = result["hist_stats"].pop("policy_main_reward")
opponent_rew = list(result["hist_stats"].values())[0]
assert len(main_rew) == len(opponent_rew)
won = 0
for r_main, r_opponent in zip(main_rew, opponent_rew):
     if r_main > r_opponent:
          won += 1
     win_rate = won / len(main_rew)
     result["win_rate"] = win_rate
  1. The first issue I am facing with my own RL script is that I do not get the same length of len(main_rew) and len(opponent_rew). Is there a reason for this, given that the initial policies are identical? Also, what do the individual numbers of the main_rew list represent? I realized that it adds up to the reward of a policy in an episode.

  2. Another question is what does hist_stats stand for?

  3. As training continues and the number of episodes increases, this list of number in main_rew also increase. Is that why this method pops out the list?

  4. Does opponent_rew in always refer to the “opposing policy”? How does it achieve that?

I apologise for the lengthy post. Thank you for the help in advance.