How to calculate the advantage in the forward method of a custom model based on the PPO algorithm

ZanhaPeng · March 24, 2025, 1:24pm

1. Severity of the issue: (select one)

High: Completely blocks me.

Under the new API interface, the forward function needs to return the advantage; otherwise, a KeyError will be reported. If a placeholder is used to return the advantage, the policy loss is always 0. How can this problem be solved? The code is as follows:
combined = torch.cat([z, coord_feat], dim=1).unsqueeze(1).unsqueeze(1)
action_embed = self.fusion(combined).squeeze()

    action_logits = self._pi_head(action_embed)
    # action_logits += torch.log(obs["action_mask"].float().to(device) + 1e-10)
    
    action_logits = torch.where(
        obs["action_mask"].bool(),
        action_logits,
        torch.tensor(-1e10, device=device)
    )
    values = self._vf_head(action_embed).squeeze(-1)

    action_dist = self.action_dist_cls(logits=action_logits)
    # action = torch.argmax(action_dist.logits, dim=-1)
    action = action_dist.sample()
    
    output = {
        Columns.ACTIONS: action,
        Columns.ACTION_DIST_INPUTS: action_logits,
        Columns.VF_PREDS: values,
        Columns.ACTION_LOGP: action_dist.logp(action),
        # Columns.ADVANTAGES: 
        # Columns.VALUE_TARGETS: 
        Columns.EMBEDDINGS: action_embed,
    }
    return output

mannyv · March 24, 2025, 5:03pm

Hi @ZanhaPeng,

I think you must be misinterpreting where the error is coming from. You cannot compute the advantage in the forward. The advantage has to be computed with a trajectory of reward, and state/action pairs.

Something else must be misconfigured. In fact I think you are including too many outputs from forward.

Here is the default forwards that PPO uses. You can see that they are only adding (“Embeddings”, “State_out”, “action_dist_inputs”).

github.com/ray-project/ray

rllib/algorithms/ppo/torch/default_ppo_torch_rl_module.py

35d472a2a


      
                  catalog_class = PPOCatalog
              super().__init__(*args, **kwargs, catalog_class=catalog_class)
          
          @override(RLModule)
          def _forward(self, batch: Dict[str, Any], **kwargs) -> Dict[str, Any]:
              """Default forward pass (used for inference and exploration)."""
              output = {}
              # Encoder forward pass.
              encoder_outs = self.encoder(batch)
              # Stateful encoder?
              if Columns.STATE_OUT in encoder_outs:
                  output[Columns.STATE_OUT] = encoder_outs[Columns.STATE_OUT]
              # Pi head.
              output[Columns.ACTION_DIST_INPUTS] = self.pi(encoder_outs[ENCODER_OUT][ACTOR])
              return output
          
          @override(RLModule)
          def _forward_train(self, batch: Dict[str, Any], **kwargs) -> Dict[str, Any]:
              """Train forward pass (keep embeddings for possible shared value func. call)."""
              output = {}
              encoder_outs = self.encoder(batch)

Perhaps you can share the error you are getting about the advantages.

Topic		Replies	Views
How to recompute the advantage in learning (ppo) RLlib	3	723	October 5, 2021
KeyError: 'advantages' Configure Algorithm, Training, Evaluation, Scaling	4	102	June 7, 2025
KeyError: 'advantages' on MARL Configure Algorithm, Training, Evaluation, Scaling	4	55	April 17, 2025
KeyError: 'advantages' when training PPO with custom model in RLlib RLlib	7	148	March 27, 2025
The role of the discount factor gamma in policy gradient algorithms RLlib	2	531	September 30, 2021

How to calculate the advantage in the forward method of a custom model based on the PPO algorithm

Related topics