RLlib (classic WorkerSet API): How to atomically add a new policy and push its weights to all rollout/eval workers? Snapshot policies stay at init on workers

Summary
In a self-play setup on RLlib’s older stack (driver + rollout workers, no Learner API), I add new opponent policies during training (league snapshots). On the driver I create the policy via algo.add_policy(...) and copy weights from "main". On remote rollout workers, though, that policy appears with initial weights (fresh init), not with the copied weights.

Key symptoms

  • With num_workers = 0 everything behaves correctly: newly added policies have the copied weights and produce the expected action distributions.

  • With num_workers > 0, those same policies on rollout workers act as if they’re still at initialization.

  • How I detected it: I log per-step action_dist (softmax over logits). For any newly added snapshot policy, the probabilities match the iteration-0 distribution, while "main" clearly progresses during training.

Environment / context

  • RLlib: classic WorkerSet/RolloutWorker API (no RLModule/LearnerGroup).

  • Algo: PPO (Torch) with custom loss.

  • Multi-agent self-play. Opponents are picked by a policy_mapping_fn from a league.

  • I create snapshot policies at runtime and insert them into the league.

    What I do (simplified)

    def copy_weights(algo, dst_policy_name: str):
        src = algo.get_policy("main")
    
        # Create the policy on the driver.
        algo.add_policy(
            dst_policy_name,
            type(src),
            observation_space=src.observation_space,
            action_space=src.action_space,
            config=getattr(src, "config", {}).copy(),
        )
    
        # Copy weights on the driver.
        algo.get_policy(dst_policy_name).set_weights(src.get_weights())
    
    

    Then I include dst_policy_name in the opponent pool so the mapping fn can select it.

    Expected behavior
    After adding a policy and copying "main"’s weights, all rollout/eval workers should hold an identical replica, so samples reflect the cloned parameters.

    Actual behavior

    • On the driver the new policy has the correct (copied) weights.

    On remote workers the new policy either gets created lazily with default init or doesn’t receive the copied weights in time. Episodes sampled against it behave like a fresh init.

    If there’s a better/official pattern for dynamic policy addition + weight broadcast in the legacy workers stack, I’d really appreciate guidance.