Question regarding example self-play implementation

Hi, I am trying to implement self-play. I have some queries regarding this script, particularly this chunk of code. ray/self_play_with_open_spiel.py at master · ray-project/ray · GitHub

main_rew = result["hist_stats"].pop("policy_main_reward")
opponent_rew = list(result["hist_stats"].values())[0]
assert len(main_rew) == len(opponent_rew)
won = 0
for r_main, r_opponent in zip(main_rew, opponent_rew):
     if r_main > r_opponent:
          won += 1
     win_rate = won / len(main_rew)
     result["win_rate"] = win_rate
  1. The first issue I am facing with my own RL script is that I do not get the same length of len(main_rew) and len(opponent_rew). Is there a reason for this, given that the initial policies are identical? Also, what do the individual numbers of the main_rew list represent? I realized that it adds up to the reward of a policy in an episode.

  2. Another question is what does hist_stats stand for?

  3. As training continues and the number of episodes increases, this list of number in main_rew also increase. Is that why this method pops out the list?

  4. Does opponent_rew in always refer to the “opposing policy”? How does it achieve that?

I apologise for the lengthy post. Thank you for the help in advance.

Hi @Jay ,

  1. hist stats are simply a list of values that can be used to plot histograms. These have been, for a while now, simply total episodic rewards and steps, and also the reward per policy. If all policies take the same amount of steps, main rewards and opponent rewards should obviously have the same length.
  2. Although it is not documented, I believe it stands for histogram stats.
  3. I believe this is unnecessary. The “main” win rate is inserted into the results further down and since that is a measurement how well main is doing vs opponent, that’s kind of the only value you need here. Not 100% sure why it was implemented this way tbh.
  4. Yes, opponent_rew always refers to the opposing policy. I’m not really sure what you mean “How does it achieve that?”