1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.51.1
- Python version: 3.12.12
- OS: Ubuntu
3. What happened vs. what you expected:
- Expected: After an update on the learner, the updated weights should propagate back to the EnvRunner for the next rollout.
- Actual: The weights on the Learner continually update, but the weights on the EnvRunner appear more-or-less unchanged.
Example of Error:
Config:
# @title config_test
target_env = TwoStepStochasticBandits
single_agent_env = target_env()
doors = [(0.5,0.7),(0.4,0.1),(0.6, 1)]
config = (
PPOConfig()
.environment(target_env, env_config={
'doors': doors
})
.env_runners(
num_env_runners=0,
num_envs_per_env_runner=1,
)
.callbacks(
functools.partial(LogActionsCallback, action_lambda=lambda: range(1,1+len(doors)))
)
.training(
lr=1e-4,
minibatch_size = 2048,
)
.rl_module(
rl_module_spec=RLModuleSpec(
module_class=ActionMaskingTorchRLModule,
model_config={
"head_fcnet_hiddens": (32,),
}
)
)
)
algo = config.build_algo()
Training Loop:
# @title Train
num_iters = 10
vf_losses = []
main_returns = []
for i in range(num_iters):
results = algo.train()
if ENV_RUNNER_RESULTS in results:
mean_return = results[ENV_RUNNER_RESULTS].get(
'episode_return_mean', np.nan
)
vf_loss = results['learners'][DEFAULT_POLICY_ID]['vf_loss']
vf_losses.append(vf_loss)
main_returns.append(mean_return)
# What actions are we getting?
actions = np.array([results[ENV_RUNNER_RESULTS][f'player_action_{i}'] for i in range(1, 4)])
actions = actions / actions.sum()
actions = ', '.join([f'{a:.2f}' for a in actions])
#
print(f"iter={i+1} VF loss={vf_loss:.2f} R={mean_return:.2f}\naction distr=[{actions}]")
# temp
# Demo of the bug
test = {
'obs': {
'observations': torch.tensor([[1.,0.,0.,0.]]),
'action_mask': torch.tensor([[0.,1.,1.,1.]])
},
}
er_outputs = algo.env_runner_group._local_env_runner.module.forward_exploration(test)['action_dist_inputs'][0][1:]
test = {
'obs': {
'observations': torch.tensor([[1.,0.,0.,0.]]),
'action_mask': torch.tensor([[0.,1.,1.,1.]])
},
}
learner_outputs = algo.learner_group._learner._module[DEFAULT_POLICY_ID].forward_exploration(test)['action_dist_inputs'][0][1:]
print(er_outputs)
print(learner_outputs)
Output:
iter=1 VF loss=0.49 R=0.46
action distr=[0.34, 0.32, 0.34]
tensor([0.1735, 0.0714, 0.1752]) # Action logits from starting state on EnvRunner, after first update
tensor([ 0.2722, -0.5590, 0.4195]) # Action logits from starting state on Learner, after first update
iter=2 VF loss=0.48 R=0.46
action distr=[0.33, 0.31, 0.36]
tensor([0.1723, 0.0744, 0.1758])
tensor([ 0.0989, -0.3425, 0.3990])
iter=3 VF loss=0.57 R=0.56
action distr=[0.31, 0.27, 0.42]
tensor([0.1726, 0.0744, 0.1759])
tensor([ 0.1174, -0.3478, 0.3808])
iter=4 VF loss=0.48 R=0.54
action distr=[0.38, 0.31, 0.31]
tensor([0.1725, 0.0746, 0.1760])
tensor([ 0.1015, -0.3280, 0.3788])
iter=5 VF loss=0.53 R=0.52
action distr=[0.42, 0.27, 0.31]
tensor([0.1727, 0.0741, 0.1761])
tensor([ 0.1313, -0.3722, 0.3960])
iter=6 VF loss=0.48 R=0.51
action distr=[0.29, 0.34, 0.37]
tensor([0.1735, 0.0746, 0.1761])
tensor([ 0.1757, -0.3607, 0.3191])
iter=7 VF loss=0.51 R=0.49
action distr=[0.36, 0.28, 0.36]
tensor([0.1725, 0.0752, 0.1769])
tensor([ 0.0546, -0.2765, 0.3851])
iter=8 VF loss=0.49 R=0.55
action distr=[0.38, 0.32, 0.30]
tensor([0.1732, 0.0744, 0.1766])
tensor([ 0.1679, -0.4032, 0.3970])
iter=9 VF loss=0.50 R=0.47
action distr=[0.31, 0.27, 0.43]
tensor([0.1726, 0.0756, 0.1769])
tensor([ 0.0727, -0.2964, 0.3902])
iter=10 VF loss=0.54 R=0.45
action distr=[0.32, 0.37, 0.32]
tensor([0.1729, 0.0752, 0.1770]) # Action logits from starting state on EnvRunner, after last update (virtually unchanged!)
tensor([ 0.1059, -0.3327, 0.3948]) # Action logits from starting state on Learner, after first update (very different!)