RLlib self play with league example stops learning after first generation

mpj15 · February 11, 2024, 3:33am

I want to do some league based self play for a custom environment. I thought I would start with the examples using open_spiel connect 4. I ran into some issues using the self_play_league_based_with_open_spiel.py script.

My issue is that after the ‘main’ policy beats the random policy for the first time, the ‘main’ policy doesn’t seem to continue to learn at all.

See the picture below of the ‘main’ win-rate with the win-rate-threshold set to 0.9. After 15 iterations it hits the threshold, makes a copy of itself (and instantiates the rest of the league) but then never continues to learn from there.

mpj15 · February 11, 2024, 3:33am

This is in contrast to the multiple generations of learning I get with the self_play_with_open_spiel.py script, as shown in the second picture (here we reach the win-rate-threshold three times in about 50 iterations, each time making a copy and continuing to outperform that version of the policy at it continues learning). I used the same hyperparameters as above, though a smaller training batch size (which is why this one takes a bit more than 20 iterations for the first generation as opposed to 15 above).

mpj15 · February 11, 2024, 3:34am

I did make two changes to the self_play_league_based_with_open_spiel.py example where I thought there were bugs:

(1) in init() I changed the initial trainable policies to only include ‘main’ instead of also including ‘main_exploiter_1’ and ‘league_exploiter_1’ as these are supposed to be initialized after the first time main beats the random policy

    def __init__(self):
        super().__init__()
        # All policies in the league.
        self.main_policies = {"main", "main_0"}
        self.main_exploiters = {"main_exploiter_0", "main_exploiter_1"}
        self.league_exploiters = {"league_exploiter_0", "league_exploiter_1"}
        # Set of currently trainable policies in the league.
        self.trainable_policies = {"main",
                                   # "main_exploiter_1", #Should only be main to start.
                                   #"league_exploiter_1"
                                   }

Also changed line 163:

#if is_main and len(self.trainable_policies) == 3:
if is_main and len(self.trainable_policies) == 1: #Should be only 1 trainable policy to start, right?

(2) When ‘main’ beats the win rate threshold, the line to generate a new policy id gave it ‘main’ again instead of ‘main_X’ as intended.

Old code, lines 173-176:

                   if policy_id in self.main_policies:
                        new_pol_id = re.sub(
                            "_\\d+$", f"_{len(self.main_policies) - 1}", policy_id #This requires a policy id of the format 'main_X', and won't work on just 'main'
                        )

I replaced it with:

                   if policy_id in self.main_policies:
                        new_pol_id = policy_id + f'_{len(self.main_policies)-1}' #only 'main' should be trainable, all other main_X are frozen
                        self.main_policies.add(new_pol_id)

Otherwise I’m using the scripts as provided in Ray 2.9.2 (though I also added the win-rate field to results in the league script to compare).

Am I missing something here? Has anyone else successfully used the league training script with connect 4 or another environment?

Topic		Replies	Views
Memory Leak in wrapper or callback? RLlib	3	332	July 20, 2023
Rllib multi agent connect 4 issues - why does it 'forget' what it learnt? RLlib	0	245	November 27, 2023
Question regarding example self-play implementation RLlib	3	405	February 16, 2024
Self-play modifications via callbacks RLlib	4	509	February 24, 2023
Board game self-play PPO RLlib	15	4017	May 4, 2021

RLlib self play with league example stops learning after first generation

Related topics