KeyError: 'advantages' in PPO MARL

The KeyError: ‘advantages’ in Ray RLlib PPO with a custom RLModule usually means that the ‘advantages’ field is missing from the training batch, which is required for PPO loss computation. This is almost always because the value function predictions (VF_PREDS) are not being returned by your RLModule’s forward methods, so RLlib cannot compute advantages during postprocessing. In your MaskedRLModule, your _forward method only returns Columns.ACTION_DIST_INPUTS, but not Columns.VF_PREDS, which is required for GAE and PPO to work (see this discussion, Ray RLlib docs).

To fix this, update your _forward (and _forward_train, if you override it) to also return Columns.VF_PREDS, e.g.:

def _forward(self, batch: TensorType, **kwargs) -> TensorType:
    logits = self.policy_net(batch["obs"]["obs"].float())
    values = self.value_net(batch["obs"]["obs"].float()).squeeze(-1)
    if "action_mask" in batch["obs"]:
        mask = batch["obs"]["action_mask"]
        logits = logits.masked_fill(mask == 0, -1e9)
    return {
        Columns.ACTION_DIST_INPUTS: logits,
        Columns.VF_PREDS: values,
    }

This will allow RLlib to compute advantages and avoid the KeyError (see example fix, Ray RLlib postprocessing).

Would you like a step-by-step explanation of why this is required and how RLlib computes advantages?

Sources:

Hint: Mention @RunLLM in the post for followups.