Actor weights don't seem to propagate from the Learner to the EnvRunner

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.51.1
  • Python version: 3.12.12
  • OS: Ubuntu

3. What happened vs. what you expected:

  • Expected: After an update on the learner, the updated weights should propagate back to the EnvRunner for the next rollout.
  • Actual: The weights on the Learner continually update, but the weights on the EnvRunner appear more-or-less unchanged.

Example of Error:

Config:

# @title config_test
target_env = TwoStepStochasticBandits
single_agent_env = target_env()
doors = [(0.5,0.7),(0.4,0.1),(0.6, 1)]

config = (
    PPOConfig()
    .environment(target_env, env_config={
        'doors': doors
    })
    .env_runners(
        num_env_runners=0,
        num_envs_per_env_runner=1,
    )
    .callbacks(
        functools.partial(LogActionsCallback, action_lambda=lambda: range(1,1+len(doors)))
    )
    .training(
        lr=1e-4,
        minibatch_size = 2048,
    )
    .rl_module(
        rl_module_spec=RLModuleSpec(
            module_class=ActionMaskingTorchRLModule,
            model_config={
                "head_fcnet_hiddens": (32,),
            }
        )
    )
)

algo = config.build_algo()

Training Loop:

# @title Train
num_iters = 10
vf_losses = []
main_returns = []

for i in range(num_iters):
  results = algo.train()
  if ENV_RUNNER_RESULTS in results:
      mean_return = results[ENV_RUNNER_RESULTS].get(
          'episode_return_mean', np.nan
      )
      vf_loss = results['learners'][DEFAULT_POLICY_ID]['vf_loss']
      vf_losses.append(vf_loss)
      main_returns.append(mean_return)
      # What actions are we getting?
      actions = np.array([results[ENV_RUNNER_RESULTS][f'player_action_{i}'] for i in range(1, 4)])
      actions = actions / actions.sum()
      actions = ', '.join([f'{a:.2f}' for a in actions])
      #
      print(f"iter={i+1} VF loss={vf_loss:.2f} R={mean_return:.2f}\naction distr=[{actions}]")
      # temp

      # Demo of the bug
      test = {
          'obs': {
              'observations': torch.tensor([[1.,0.,0.,0.]]),
              'action_mask': torch.tensor([[0.,1.,1.,1.]])
          },
      }

      er_outputs = algo.env_runner_group._local_env_runner.module.forward_exploration(test)['action_dist_inputs'][0][1:]

      test = {
          'obs': {
              'observations': torch.tensor([[1.,0.,0.,0.]]),
              'action_mask': torch.tensor([[0.,1.,1.,1.]])
          },
      }

      learner_outputs = algo.learner_group._learner._module[DEFAULT_POLICY_ID].forward_exploration(test)['action_dist_inputs'][0][1:]

      print(er_outputs)
      print(learner_outputs)

Output:

iter=1 VF loss=0.49 R=0.46
action distr=[0.34, 0.32, 0.34]
tensor([0.1735, 0.0714, 0.1752])    # Action logits from starting state on EnvRunner, after first update
tensor([ 0.2722, -0.5590,  0.4195]) # Action logits from starting state on Learner, after first update
iter=2 VF loss=0.48 R=0.46
action distr=[0.33, 0.31, 0.36]
tensor([0.1723, 0.0744, 0.1758])
tensor([ 0.0989, -0.3425,  0.3990])
iter=3 VF loss=0.57 R=0.56
action distr=[0.31, 0.27, 0.42]
tensor([0.1726, 0.0744, 0.1759])
tensor([ 0.1174, -0.3478,  0.3808])
iter=4 VF loss=0.48 R=0.54
action distr=[0.38, 0.31, 0.31]
tensor([0.1725, 0.0746, 0.1760])
tensor([ 0.1015, -0.3280,  0.3788])
iter=5 VF loss=0.53 R=0.52
action distr=[0.42, 0.27, 0.31]
tensor([0.1727, 0.0741, 0.1761])
tensor([ 0.1313, -0.3722,  0.3960])
iter=6 VF loss=0.48 R=0.51
action distr=[0.29, 0.34, 0.37]
tensor([0.1735, 0.0746, 0.1761])
tensor([ 0.1757, -0.3607,  0.3191])
iter=7 VF loss=0.51 R=0.49
action distr=[0.36, 0.28, 0.36]
tensor([0.1725, 0.0752, 0.1769])
tensor([ 0.0546, -0.2765,  0.3851])
iter=8 VF loss=0.49 R=0.55
action distr=[0.38, 0.32, 0.30]
tensor([0.1732, 0.0744, 0.1766])
tensor([ 0.1679, -0.4032,  0.3970])
iter=9 VF loss=0.50 R=0.47
action distr=[0.31, 0.27, 0.43]
tensor([0.1726, 0.0756, 0.1769])
tensor([ 0.0727, -0.2964,  0.3902])
iter=10 VF loss=0.54 R=0.45
action distr=[0.32, 0.37, 0.32]
tensor([0.1729, 0.0752, 0.1770])   # Action logits from starting state on EnvRunner, after last update (virtually unchanged!)
tensor([ 0.1059, -0.3327,  0.3948])   # Action logits from starting state on Learner, after first update (very different!)

This is a known issue: in some cases, weights updated on the Learner are not properly propagated to the EnvRunner, causing the EnvRunner to use stale weights for rollouts. This can happen if the weight synchronization step is not triggered or completed after each learner update. The expected behavior is that after each training iteration, the EnvRunner’s weights should match the Learner’s, but your logs show they remain unchanged on the EnvRunner while the Learner’s weights evolve. This issue has been reported by other users as well, especially when using custom modules or certain configurations with the new API stack.

To ensure weights are synchronized, you may need to explicitly call the synchronization method after each update. In RLlib’s new API stack, this is typically handled by env_runner_group.sync_weights(from_worker_or_learner_group=learner_group, inference_only=True) after the learner update, as shown in the PPO and DreamerV3 training loops (see example). If this is already present and the issue persists, it may be a bug in the synchronization logic or a race condition. You can also check for related issues and workarounds in the Ray GitHub issues and discussions.

Would you like a step-by-step breakdown of how to debug or force weight synchronization in your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

The LLM’s response doesn’t really help - it (correctly) points out that the weights should sync, as shown on line 454 of ppo.py, but doesn’t explain why they don’t. It claims that others have reported this issue before, but the issue it links as an example (which I wrote!) is only tangentially relevant, and doesn’t refer to this problem.

Update: I’ve figured out the cause of this bug. The automatic configuration done to handle inference_only behavior causes a mismatch in the weight dictionaries. Setting ‘vf_share_layers’: True causes the issue to desist (but also means that the encoder will be shared, which may not be desirable).