Columns.LOSS_MASK is used to denote environment steps that are virtual, or otherwise shouldn’t be used for weight updates. But the purpose of including masked steps at all seems to be for the sake of the GAE. Why doesn’t the GAE simply discard these masked steps when it’s done, instead of having the learner run computations on them and then discard the resulting gradients when the mask is applied?
Put another way, why not run the following pseudocode in a connector that is applied after GeneralAdvantageEstimation finishes computing value targets and advantages?
for agent in batch:
agent_batch = batch[agent]
loss_mask = agent_batch[Columns.LOSS_MASK]
# drop loss mask from agent's batch
del agent_batch[Columns.LOSS_MASK]
for key in agent_batch:
# discard masked steps before passing data to learner
agent_batch[key] = agent_batch[key][loss_mask]
We could then strip out the possibly_masked_mean logic in the learner.
RLlib keeps masked steps (e.g., those added by AddOneTsToEpisodesAndTruncate for bootstrapping) in the batch so that the GAE computation can use them for value estimation, but then applies Columns.LOSS_MASK during loss calculation to ensure only valid steps contribute to gradients. This design allows the batch to maintain correct sequence alignment and padding (especially for RNNs/LSTMs), and ensures that all tensors remain the same shape, which is important for efficient batching and parallelization. Stripping out masked steps before the learner would break this alignment and complicate handling of stateful models and multi-agent cases, which is why the mask is applied during loss computation instead of physically removing the steps from the batch. See ppo_torch_learner.py and add_one_ts_to_episodes_and_truncate.py for details.
If you want more detail on the technical reasons or alternatives, let me know.
I’ve interacted heavily with both of those files, and I don’t really see how they depend on masked steps not being removed. An LSTM or a transformer certainly wouldn’t want virtual steps mixed into its state calculation or attention calculations, as that would break consistency with the original action distributions (I don’t think this comes up in any of the production code, though - virtual steps only appear after all legitimate ones in an episode).