Actor weights don't seem to propagate from the Learner to the EnvRunner

MCW_Lad · November 17, 2025, 7:51am

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.51.1
Python version: 3.12.12
OS: Ubuntu

3. What happened vs. what you expected:

Expected: After an update on the learner, the updated weights should propagate back to the EnvRunner for the next rollout.
Actual: The weights on the Learner continually update, but the weights on the EnvRunner appear more-or-less unchanged.

Example of Error:

Config:

# @title config_test
target_env = TwoStepStochasticBandits
single_agent_env = target_env()
doors = [(0.5,0.7),(0.4,0.1),(0.6, 1)]

config = (
    PPOConfig()
    .environment(target_env, env_config={
        'doors': doors
    })
    .env_runners(
        num_env_runners=0,
        num_envs_per_env_runner=1,
    )
    .callbacks(
        functools.partial(LogActionsCallback, action_lambda=lambda: range(1,1+len(doors)))
    )
    .training(
        lr=1e-4,
        minibatch_size = 2048,
    )
    .rl_module(
        rl_module_spec=RLModuleSpec(
            module_class=ActionMaskingTorchRLModule,
            model_config={
                "head_fcnet_hiddens": (32,),
            }
        )
    )
)

algo = config.build_algo()

Training Loop:

# @title Train
num_iters = 10
vf_losses = []
main_returns = []

for i in range(num_iters):
  results = algo.train()
  if ENV_RUNNER_RESULTS in results:
      mean_return = results[ENV_RUNNER_RESULTS].get(
          'episode_return_mean', np.nan
      )
      vf_loss = results['learners'][DEFAULT_POLICY_ID]['vf_loss']
      vf_losses.append(vf_loss)
      main_returns.append(mean_return)
      # What actions are we getting?
      actions = np.array([results[ENV_RUNNER_RESULTS][f'player_action_{i}'] for i in range(1, 4)])
      actions = actions / actions.sum()
      actions = ', '.join([f'{a:.2f}' for a in actions])
      #
      print(f"iter={i+1} VF loss={vf_loss:.2f} R={mean_return:.2f}\naction distr=[{actions}]")
      # temp

      # Demo of the bug
      test = {
          'obs': {
              'observations': torch.tensor([[1.,0.,0.,0.]]),
              'action_mask': torch.tensor([[0.,1.,1.,1.]])
          },
      }

      er_outputs = algo.env_runner_group._local_env_runner.module.forward_exploration(test)['action_dist_inputs'][0][1:]

      test = {
          'obs': {
              'observations': torch.tensor([[1.,0.,0.,0.]]),
              'action_mask': torch.tensor([[0.,1.,1.,1.]])
          },
      }

      learner_outputs = algo.learner_group._learner._module[DEFAULT_POLICY_ID].forward_exploration(test)['action_dist_inputs'][0][1:]

      print(er_outputs)
      print(learner_outputs)

Output:

iter=1 VF loss=0.49 R=0.46
action distr=[0.34, 0.32, 0.34]
tensor([0.1735, 0.0714, 0.1752])    # Action logits from starting state on EnvRunner, after first update
tensor([ 0.2722, -0.5590,  0.4195]) # Action logits from starting state on Learner, after first update
iter=2 VF loss=0.48 R=0.46
action distr=[0.33, 0.31, 0.36]
tensor([0.1723, 0.0744, 0.1758])
tensor([ 0.0989, -0.3425,  0.3990])
iter=3 VF loss=0.57 R=0.56
action distr=[0.31, 0.27, 0.42]
tensor([0.1726, 0.0744, 0.1759])
tensor([ 0.1174, -0.3478,  0.3808])
iter=4 VF loss=0.48 R=0.54
action distr=[0.38, 0.31, 0.31]
tensor([0.1725, 0.0746, 0.1760])
tensor([ 0.1015, -0.3280,  0.3788])
iter=5 VF loss=0.53 R=0.52
action distr=[0.42, 0.27, 0.31]
tensor([0.1727, 0.0741, 0.1761])
tensor([ 0.1313, -0.3722,  0.3960])
iter=6 VF loss=0.48 R=0.51
action distr=[0.29, 0.34, 0.37]
tensor([0.1735, 0.0746, 0.1761])
tensor([ 0.1757, -0.3607,  0.3191])
iter=7 VF loss=0.51 R=0.49
action distr=[0.36, 0.28, 0.36]
tensor([0.1725, 0.0752, 0.1769])
tensor([ 0.0546, -0.2765,  0.3851])
iter=8 VF loss=0.49 R=0.55
action distr=[0.38, 0.32, 0.30]
tensor([0.1732, 0.0744, 0.1766])
tensor([ 0.1679, -0.4032,  0.3970])
iter=9 VF loss=0.50 R=0.47
action distr=[0.31, 0.27, 0.43]
tensor([0.1726, 0.0756, 0.1769])
tensor([ 0.0727, -0.2964,  0.3902])
iter=10 VF loss=0.54 R=0.45
action distr=[0.32, 0.37, 0.32]
tensor([0.1729, 0.0752, 0.1770])   # Action logits from starting state on EnvRunner, after last update (virtually unchanged!)
tensor([ 0.1059, -0.3327,  0.3948])   # Action logits from starting state on Learner, after first update (very different!)

RunLLM · November 17, 2025, 7:52am

This is a known issue: in some cases, weights updated on the Learner are not properly propagated to the EnvRunner, causing the EnvRunner to use stale weights for rollouts. This can happen if the weight synchronization step is not triggered or completed after each learner update. The expected behavior is that after each training iteration, the EnvRunner’s weights should match the Learner’s, but your logs show they remain unchanged on the EnvRunner while the Learner’s weights evolve. This issue has been reported by other users as well, especially when using custom modules or certain configurations with the new API stack.

To ensure weights are synchronized, you may need to explicitly call the synchronization method after each update. In RLlib’s new API stack, this is typically handled by env_runner_group.sync_weights(from_worker_or_learner_group=learner_group, inference_only=True) after the learner update, as shown in the PPO and DreamerV3 training loops (see example). If this is already present and the issue persists, it may be a bug in the synchronization logic or a race condition. You can also check for related issues and workarounds in the Ray GitHub issues and discussions.

Would you like a step-by-step breakdown of how to debug or force weight synchronization in your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

MCW_Lad · November 17, 2025, 7:55am

The LLM’s response doesn’t really help - it (correctly) points out that the weights should sync, as shown on line 454 of ppo.py, but doesn’t explain why they don’t. It claims that others have reported this issue before, but the issue it links as an example (which I wrote!) is only tangentially relevant, and doesn’t refer to this problem.

MCW_Lad · November 17, 2025, 10:43pm

Update: I’ve figured out the cause of this bug. The automatic configuration done to handle inference_only behavior causes a mismatch in the weight dictionaries. Setting ‘vf_share_layers’: True causes the issue to desist (but also means that the encoder will be shared, which may not be desirable).

github.com/ray-project/ray

[RLlib] Actor weights don't appear to propagate from the Learner to the EnvRunner

opened 09:05AM - 17 Nov 25 UTC

MatthewCWeston

bug triage rllib stability community-backlog

### What happened + What you expected to happen While working on an experiment,… I found that the weights learned by my actors were not propagating back to the EnvRunner for the next round of rollouts. Instead, the agent in charge of rollouts continued to generate actions from its initial, randomly-weighted policy. I've managed to get the issue down to a concise replication script, attached below. The key observation is that the outputs of the EnvRunner's policy network remain virtually unchanged between training steps, while the Learner's policy network continues to change until the gap between the two starts to interfere with learning *(since `logp_ratio` diverges)*. ### Versions / Dependencies Ray version: 2.51.1 Python version: 3.12.12 OS: Ubuntu ### Reproduction script Reproduction script: ``` # Imports import numpy as np from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.rl_module import RLModuleSpec from ray.rllib.utils.metrics import ENV_RUNNER_RESULTS import torch from ray.rllib.core.columns import Columns from ray.rllib.policy.sample_batch import DEFAULT_POLICY_ID # OneStepEnv import gymnasium as gym from gymnasium.spaces import Box, Discrete class OneStepEnv(gym.Env): ''' Takes a series of reward values in its config, then returns them after their respective actions are taken. ''' def __init__(self, config={}): super().__init__() self.doors = config['doors'] self.sc = len(self.doors) self.observation_space = Box(0.0, 1.0, shape=(1,)) self.action_space = Discrete(self.sc) def reset(self, *, seed=None, options=None): return np.array([1]), {} # <- empty infos dict def step(self, action): reward = 0 terminated = True u = self.doors[action] reward = u return np.array([1]), reward, terminated, {}, {} # config target_env = OneStepEnv doors = [1.,2.,3.] config = ( PPOConfig() .environment(target_env, env_config={ 'doors': doors }) .env_runners( num_env_runners=0, num_envs_per_env_runner=1, ) .rl_module( rl_module_spec=RLModuleSpec( model_config={ "head_fcnet_hiddens": (32,), } ) ) ) algo = config.build_algo() # Display initial logits on learner and env runner test = { 'obs': torch.tensor([[1.]]) } er_outputs = algo.env_runner_group._local_env_runner.module.forward_exploration(test)['action_dist_inputs'][0] test = { 'obs': torch.tensor([[1.]]) } learner_outputs = algo.learner_group._learner._module[DEFAULT_POLICY_ID].forward_exploration(test)['action_dist_inputs'][0] print(f'Initial Env Runner action logits: {er_outputs}') print(f'Initial Learner action logits: {learner_outputs}') # Train num_iters = 10 vf_losses = [] main_returns = [] for i in range(num_iters): results = algo.train() if ENV_RUNNER_RESULTS in results: mean_return = results[ENV_RUNNER_RESULTS].get( 'episode_return_mean', np.nan ) vf_loss = results['learners'][DEFAULT_POLICY_ID]['vf_loss'] vf_losses.append(vf_loss) main_returns.append(mean_return) # print(f"iter={i+1} VF loss={vf_loss:.2f} R={mean_return:.2f}") # Demo of the bug test = { 'obs': torch.tensor([[1.]]) } er_outputs = algo.env_runner_group._local_env_runner.module.forward_exploration(test)['action_dist_inputs'][0] test = { 'obs': torch.tensor([[1.]]) } learner_outputs = algo.learner_group._learner._module[DEFAULT_POLICY_ID].forward_exploration(test)['action_dist_inputs'][0] print(f'Env Runner action logits: {er_outputs}') print(f'Learner action logits: {learner_outputs}') ``` Output: ``` Initial Env Runner action logits: tensor([-0.0548, -0.0778, 0.1948]) Initial Learner action logits: tensor([0.0334, 0.0442, 0.2265]) iter=1 VF loss=0.57 R=2.24 Env Runner action logits: tensor([-0.0536, -0.0785, 0.1989]) Learner action logits: tensor([-0.8589, 0.5213, 0.7914]) iter=2 VF loss=0.60 R=1.97 Env Runner action logits: tensor([-0.0538, -0.0752, 0.1981]) Learner action logits: tensor([-0.8550, 0.4970, 0.7796]) iter=3 VF loss=0.68 R=2.09 Env Runner action logits: tensor([-0.0532, -0.0724, 0.1978]) Learner action logits: tensor([-0.7766, 0.4021, 0.7211]) iter=4 VF loss=0.73 R=2.16 Env Runner action logits: tensor([-0.0518, -0.0684, 0.1968]) Learner action logits: tensor([-0.6173, 0.2388, 0.6401]) iter=5 VF loss=0.68 R=2.04 Env Runner action logits: tensor([-0.0522, -0.0666, 0.1957]) Learner action logits: tensor([-0.4889, 0.0982, 0.5816]) iter=6 VF loss=0.67 R=2.06 Env Runner action logits: tensor([-0.0531, -0.0645, 0.1957]) Learner action logits: tensor([-0.4119, 0.0010, 0.5453]) iter=7 VF loss=0.64 R=2.09 Env Runner action logits: tensor([-0.0542, -0.0635, 0.1959]) Learner action logits: tensor([-0.3685, -0.0682, 0.5210]) iter=8 VF loss=0.75 R=2.05 Env Runner action logits: tensor([-0.0538, -0.0609, 0.1959]) Learner action logits: tensor([-0.2628, -0.0390, 0.4197]) iter=9 VF loss=0.72 R=2.18 Env Runner action logits: tensor([-0.0553, -0.0604, 0.1970]) Learner action logits: tensor([-0.2852, -0.0590, 0.4592]) iter=10 VF loss=0.67 R=2.14 # Agent should be able to learn to always pick the third action, but does not! Env Runner action logits: tensor([-0.0565, -0.0565, 0.1980]) # Virtually unchanged from initial logits Learner action logits: tensor([-0.1960, -0.0683, 0.3694]) # Very different from initial logits ``` ### Issue Severity High: It blocks me from completing my task.

Topic		Replies	Views
Policy weights overwritten in self-play RLlib	14	1024	July 14, 2021
Custom Env (PPO + Action Masking) GPU and CPU mismatch error RLlib	2	66	July 30, 2025
Rewards leaks to different multi agent policies in training only Configure Algorithm, Training, Evaluation, Scaling	3	184	May 31, 2024
Unable to replicate original PPO performance RLlib	0	208	May 10, 2024
Unexpected dramatic drop in reward RLlib	8	1015	November 13, 2023

Actor weights don't seem to propagate from the Learner to the EnvRunner

Example of Error:

Config:

Training Loop:

Output:

Related topics