Hi, I am trying to implement self-play. I have some queries regarding this script, particularly this chunk of code. ray/self_play_with_open_spiel.py at master · ray-project/ray · GitHub
main_rew = result["hist_stats"].pop("policy_main_reward")
opponent_rew = list(result["hist_stats"].values())[0]
assert len(main_rew) == len(opponent_rew)
won = 0
for r_main, r_opponent in zip(main_rew, opponent_rew):
if r_main > r_opponent:
won += 1
win_rate = won / len(main_rew)
result["win_rate"] = win_rate
-
The first issue I am facing with my own RL script is that I do not get the same length of len(main_rew) and len(opponent_rew). Is there a reason for this, given that the initial policies are identical? Also, what do the individual numbers of the main_rew list represent? I realized that it adds up to the reward of a policy in an episode.
-
Another question is what does hist_stats stand for?
-
As training continues and the number of episodes increases, this list of number in main_rew also increase. Is that why this method pops out the list?
-
Does opponent_rew in always refer to the “opposing policy”? How does it achieve that?
I apologise for the lengthy post. Thank you for the help in advance.