Summary
In a self-play setup on RLlib’s older stack (driver + rollout workers, no Learner API), I add new opponent policies during training (league snapshots). On the driver I create the policy via algo.add_policy(...)
and copy weights from "main"
. On remote rollout workers, though, that policy appears with initial weights (fresh init), not with the copied weights.
Key symptoms
-
With
num_workers = 0
everything behaves correctly: newly added policies have the copied weights and produce the expected action distributions. -
With
num_workers > 0
, those same policies on rollout workers act as if they’re still at initialization. -
How I detected it: I log per-step
action_dist
(softmax over logits). For any newly added snapshot policy, the probabilities match the iteration-0 distribution, while"main"
clearly progresses during training.
Environment / context
-
RLlib: classic WorkerSet/RolloutWorker API (no RLModule/LearnerGroup).
-
Algo: PPO (Torch) with custom loss.
-
Multi-agent self-play. Opponents are picked by a
policy_mapping_fn
from a league. -
I create snapshot policies at runtime and insert them into the league.
What I do (simplified)
def copy_weights(algo, dst_policy_name: str): src = algo.get_policy("main") # Create the policy on the driver. algo.add_policy( dst_policy_name, type(src), observation_space=src.observation_space, action_space=src.action_space, config=getattr(src, "config", {}).copy(), ) # Copy weights on the driver. algo.get_policy(dst_policy_name).set_weights(src.get_weights())
Then I include
dst_policy_name
in the opponent pool so the mapping fn can select it.Expected behavior
After adding a policy and copying"main"
’s weights, all rollout/eval workers should hold an identical replica, so samples reflect the cloned parameters.Actual behavior
- On the driver the new policy has the correct (copied) weights.
On remote workers the new policy either gets created lazily with default init or doesn’t receive the copied weights in time. Episodes sampled against it behave like a fresh init.
If there’s a better/official pattern for dynamic policy addition + weight broadcast in the legacy workers stack, I’d really appreciate guidance.