Question regarding example self-play implementation

Hi, I am trying to implement self-play. I have some queries regarding this script, particularly this chunk of code. ray/ at master · ray-project/ray · GitHub

main_rew = result["hist_stats"].pop("policy_main_reward")
opponent_rew = list(result["hist_stats"].values())[0]
assert len(main_rew) == len(opponent_rew)
won = 0
for r_main, r_opponent in zip(main_rew, opponent_rew):
     if r_main > r_opponent:
          won += 1
     win_rate = won / len(main_rew)
     result["win_rate"] = win_rate
  1. The first issue I am facing with my own RL script is that I do not get the same length of len(main_rew) and len(opponent_rew). Is there a reason for this, given that the initial policies are identical? Also, what do the individual numbers of the main_rew list represent? I realized that it adds up to the reward of a policy in an episode.

  2. Another question is what does hist_stats stand for?

  3. As training continues and the number of episodes increases, this list of number in main_rew also increase. Is that why this method pops out the list?

  4. Does opponent_rew in always refer to the “opposing policy”? How does it achieve that?

I apologise for the lengthy post. Thank you for the help in advance.

Hi @Jay ,

  1. hist stats are simply a list of values that can be used to plot histograms. These have been, for a while now, simply total episodic rewards and steps, and also the reward per policy. If all policies take the same amount of steps, main rewards and opponent rewards should obviously have the same length.
  2. Although it is not documented, I believe it stands for histogram stats.
  3. I believe this is unnecessary. The “main” win rate is inserted into the results further down and since that is a measurement how well main is doing vs opponent, that’s kind of the only value you need here. Not 100% sure why it was implemented this way tbh.
  4. Yes, opponent_rew always refers to the opposing policy. I’m not really sure what you mean “How does it achieve that?”
1 Like

Hi Jay - I was trying to work these problems out as well. I am new to Ray and when I tried implemented a self play algorithm (following the same Github example), I also faced different episode lengths for different opponents… Here’s my analysis:

  1. The Self play example is really poorly written and makes some assumptions. The result["hist_stats"] dictionary includes at the very beginning a list of the episode rewards. Because Connect 4 is a Zero Sum game (if I win you lose and if you lose I win), the episode reward is always 0 for each episode. What the developers have done is that they are not really comparing the main_rewards against the opponent_rewards - instead they are comparing the main_reward against the episode_reward (the list that is full of zeros). The logic still holds true: if main gets +1, it’s possible to infer that the opponent got -1… The calculation is totally unnecessary though, and you could substitute their loop with:
    for r_main in main_rew: if r_main > 0: won += 1

popping the item out of the list is totally unnecessary IMO and could have been simply sliced. I still dont know if the result[“hist_stats”] object is returned back to the trainer after the CallBack or not…

I hope this help!
Cheers, Eyas

Hi @arturn,
I cannot not find the “main” win rate which you reference in the results.