Hi, I am trying to implement self-play. I have some queries regarding this script, particularly this chunk of code. ray/self_play_with_open_spiel.py at master · ray-project/ray · GitHub
main_rew = result["hist_stats"].pop("policy_main_reward") opponent_rew = list(result["hist_stats"].values()) assert len(main_rew) == len(opponent_rew) won = 0 for r_main, r_opponent in zip(main_rew, opponent_rew): if r_main > r_opponent: won += 1 win_rate = won / len(main_rew) result["win_rate"] = win_rate
The first issue I am facing with my own RL script is that I do not get the same length of len(main_rew) and len(opponent_rew). Is there a reason for this, given that the initial policies are identical? Also, what do the individual numbers of the main_rew list represent? I realized that it adds up to the reward of a policy in an episode.
Another question is what does hist_stats stand for?
As training continues and the number of episodes increases, this list of number in main_rew also increase. Is that why this method pops out the list?
Does opponent_rew in always refer to the “opposing policy”? How does it achieve that?
I apologise for the lengthy post. Thank you for the help in advance.