RLlib self play with league example stops learning after first generation

I want to do some league based self play for a custom environment. I thought I would start with the examples using open_spiel connect 4. I ran into some issues using the self_play_league_based_with_open_spiel.py script.

My issue is that after the ‘main’ policy beats the random policy for the first time, the ‘main’ policy doesn’t seem to continue to learn at all.

See the picture below of the ‘main’ win-rate with the win-rate-threshold set to 0.9. After 15 iterations it hits the threshold, makes a copy of itself (and instantiates the rest of the league) but then never continues to learn from there.

This is in contrast to the multiple generations of learning I get with the self_play_with_open_spiel.py script, as shown in the second picture (here we reach the win-rate-threshold three times in about 50 iterations, each time making a copy and continuing to outperform that version of the policy at it continues learning). I used the same hyperparameters as above, though a smaller training batch size (which is why this one takes a bit more than 20 iterations for the first generation as opposed to 15 above).

I did make two changes to the self_play_league_based_with_open_spiel.py example where I thought there were bugs:

(1) in init() I changed the initial trainable policies to only include ‘main’ instead of also including ‘main_exploiter_1’ and ‘league_exploiter_1’ as these are supposed to be initialized after the first time main beats the random policy

    def __init__(self):
        super().__init__()
        # All policies in the league.
        self.main_policies = {"main", "main_0"}
        self.main_exploiters = {"main_exploiter_0", "main_exploiter_1"}
        self.league_exploiters = {"league_exploiter_0", "league_exploiter_1"}
        # Set of currently trainable policies in the league.
        self.trainable_policies = {"main",
                                   # "main_exploiter_1", #Should only be main to start.
                                   #"league_exploiter_1"
                                   }

Also changed line 163:

#if is_main and len(self.trainable_policies) == 3:
if is_main and len(self.trainable_policies) == 1: #Should be only 1 trainable policy to start, right?

(2) When ‘main’ beats the win rate threshold, the line to generate a new policy id gave it ‘main’ again instead of ‘main_X’ as intended.

Old code, lines 173-176:

                   if policy_id in self.main_policies:
                        new_pol_id = re.sub(
                            "_\\d+$", f"_{len(self.main_policies) - 1}", policy_id #This requires a policy id of the format 'main_X', and won't work on just 'main'
                        )

I replaced it with:

                   if policy_id in self.main_policies:
                        new_pol_id = policy_id + f'_{len(self.main_policies)-1}' #only 'main' should be trainable, all other main_X are frozen
                        self.main_policies.add(new_pol_id)

Otherwise I’m using the scripts as provided in Ray 2.9.2 (though I also added the win-rate field to results in the league script to compare).

Am I missing something here? Has anyone else successfully used the league training script with connect 4 or another environment?