High: Completely blocks me.
2. Environment:
- Ray version: 2.49.1
- Python version: 3.12.11
- OS: Ubuntu
Hello, all. I’ve been working on a high-fidelity implementation of AlphaStar’s league algorithm within RLlib, and I think I’ve gotten pretty close, reconciling the differences between the paper and the released pseudocode, and accounting for the eccentricities of applying their algorithm to my target environment, which has a high enough draw rate that I had to change the PFSP implementation a bit to compensate. However, I think I’ve hit a roadblock in terms of infrastructure:
AlphaStar, notably, differentiates the ‘student’ agent from the ‘teacher’ agent, when assigning matches. The main exploiter will target the main agent the majority of the time, but the results of these matches are used only to optimize the exploiter, whereas main is only updated when it specifically looks for a challenging exploiter. This ensures that poorly-performing exploiters do not constitute a disproportionate share of the main agent’s training data. If this measure were not taken, it would lead to training collapse, as seen in my current results (shown below).[1]
As the exploiter falls behind with no means of catching up, main is increasingly incentivized to pursue suboptimal strategies that exploit its failings but leave it vulnerable to stronger opponents.
Ideally, I would like to be able to specify in my agent_to_module_mapping_fn that agent_to_train should update its weights after the episode, whereas its opponent, even if it is included in policies_to_train, should not be updated. I expect that my solution will have to be a little hacky to deal with such an unconventional ask, but I feel like there must be a preferred way to do this, which is more elegant than its alternatives. Any advice would be greatly appreciated.
def atm_fn(agent_id, episode, **kwargs):
eid = hash(episode.id_)
rng = np.random.default_rng(seed=abs(eid))
r1 = rng.random()
# The learning agent this episode is 'for', distributed evenly b/t agents
agent_to_train = "main" if (r1 < 1/3) else "main_exploiter" if (r1 < 2/3) else "league_exploiter"
if (eid % 2 == 0) != (agent_id==1):
return agent_to_train
# Select an opponent.
if (agent_to_train=="main"): # opponents for main
rand = rng.random()
if (rand < .35): # 35% self play
return "main"
elif (rand < .85): # 50% PFSP (any other agent)
valid_options = filter(lambda s: s!='main', agent_names)
else: # 15% any agent with > 70% WR against main, or SP if none
valid_options =list(filter(lambda s: wr[s]['main'] > 0.7, agent_names))
if (len(valid_options)==0):
return "main"
elif (agent_to_train=="main_exploiter"): # opponents for ME
wr_thresh_me = wr["main"]["main_exploiter"] / 9 # w/w+l >= 10%
if (wr["main_exploiter"]["main"] > wr_thresh_me and rng.random() > .5):
return "main" # 50% play versus main, if it's doing well
# Otherwise PFSP against main's past copies
valid_options = filter(lambda s: s[:6] == 'main_v', agent_names)
else: # opponents for LE (all past players; fig 1)
valid_options = filter(lambda s: '_v' in s, agent_names)
# Run PFSP on our options
valid_options = filter(lambda s: s not in just_added, valid_options)
return pfsp(agent_to_train, list(valid_options), wr, rng)
There is a remedial mechanism for main exploiters that fall behind, but even with this implemented, exploiters require experience against
mainto become useful, and the process of getting this experience is detrimental tomain’s learning ↩︎

