RLlib (classic WorkerSet API): How to atomically add a new policy and push its weights to all rollout/eval workers? Snapshot policies stay at init on workers

skoootch · September 30, 2025, 8:18pm

Summary
In a self-play setup on RLlib’s older stack (driver + rollout workers, no Learner API), I add new opponent policies during training (league snapshots). On the driver I create the policy via algo.add_policy(...) and copy weights from "main". On remote rollout workers, though, that policy appears with initial weights (fresh init), not with the copied weights.

Key symptoms

With num_workers = 0 everything behaves correctly: newly added policies have the copied weights and produce the expected action distributions.
With num_workers > 0, those same policies on rollout workers act as if they’re still at initialization.
How I detected it: I log per-step action_dist (softmax over logits). For any newly added snapshot policy, the probabilities match the iteration-0 distribution, while "main" clearly progresses during training.

Environment / context

RLlib: classic WorkerSet/RolloutWorker API (no RLModule/LearnerGroup).
Algo: PPO (Torch) with custom loss.
Multi-agent self-play. Opponents are picked by a policy_mapping_fn from a league.
I create snapshot policies at runtime and insert them into the league.

What I do (simplified)
```
def copy_weights(algo, dst_policy_name: str):
    src = algo.get_policy("main")

    # Create the policy on the driver.
    algo.add_policy(
        dst_policy_name,
        type(src),
        observation_space=src.observation_space,
        action_space=src.action_space,
        config=getattr(src, "config", {}).copy(),
    )

    # Copy weights on the driver.
    algo.get_policy(dst_policy_name).set_weights(src.get_weights())
```
Then I include dst_policy_name in the opponent pool so the mapping fn can select it.

Expected behavior
After adding a policy and copying "main"’s weights, all rollout/eval workers should hold an identical replica, so samples reflect the cloned parameters.

Actual behavior
- On the driver the new policy has the correct (copied) weights.
On remote workers the new policy either gets created lazily with default init or doesn’t receive the copied weights in time. Episodes sampled against it behave like a fresh init.

If there’s a better/official pattern for dynamic policy addition + weight broadcast in the legacy workers stack, I’d really appreciate guidance.

Topic		Replies	Views
Policy weights overwritten in self-play RLlib	14	994	July 14, 2021
Board game self-play PPO RLlib	15	4105	May 4, 2021
How do I copy the model? RLlib	2	449	June 28, 2021
Loading pre-trained single-agent policy weights for multi-agent training RLlib	2	900	June 11, 2021
How to pass information from agent to env RLlib	2	302	October 13, 2021

RLlib (classic WorkerSet API): How to atomically add a new policy and push its weights to all rollout/eval workers? Snapshot policies stay at init on workers

Related topics