Why do we not simply delete masked out environment steps in a connector?

MCW_Lad · December 23, 2025, 12:07am

Hello, all. I’ve been playing around with my implementation of AlphaStar in RLlib, and I’ve run into a question about how RLlib works.

Columns.LOSS_MASK is used to denote environment steps that are virtual, or otherwise shouldn’t be used for weight updates. But the purpose of including masked steps at all seems to be for the sake of the GAE. Why doesn’t the GAE simply discard these masked steps when it’s done, instead of having the learner run computations on them and then discard the resulting gradients when the mask is applied?

Put another way, why not run the following pseudocode in a connector that is applied after GeneralAdvantageEstimation finishes computing value targets and advantages?

for agent in batch:
     agent_batch = batch[agent]
     loss_mask = agent_batch[Columns.LOSS_MASK]
     # drop loss mask from agent's batch
     del agent_batch[Columns.LOSS_MASK]
     for key in agent_batch:
          # discard masked steps before passing data to learner
          agent_batch[key] = agent_batch[key][loss_mask]

We could then strip out the possibly_masked_mean logic in the learner.

RunLLM · December 23, 2025, 12:07am

RLlib keeps masked steps (e.g., those added by AddOneTsToEpisodesAndTruncate for bootstrapping) in the batch so that the GAE computation can use them for value estimation, but then applies Columns.LOSS_MASK during loss calculation to ensure only valid steps contribute to gradients. This design allows the batch to maintain correct sequence alignment and padding (especially for RNNs/LSTMs), and ensures that all tensors remain the same shape, which is important for efficient batching and parallelization. Stripping out masked steps before the learner would break this alignment and complicate handling of stateful models and multi-agent cases, which is why the mask is applied during loss computation instead of physically removing the steps from the batch. See ppo_torch_learner.py and add_one_ts_to_episodes_and_truncate.py for details.

If you want more detail on the technical reasons or alternatives, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.

MCW_Lad · December 23, 2025, 12:10am

I’ve interacted heavily with both of those files, and I don’t really see how they depend on masked steps not being removed. An LSTM or a transformer certainly wouldn’t want virtual steps mixed into its state calculation or attention calculations, as that would break consistency with the original action distributions (I don’t think this comes up in any of the production code, though - virtual steps only appear after all legitimate ones in an episode).

Topic		Replies	Views
Action masks and loss functions RLlib	1	427	January 25, 2021
Problem with action masking RLlib	7	2302	May 19, 2022
Using Connectors to store, retrieve, and apply an action mask? Configure Algorithm, Training, Evaluation, Scaling	1	35	July 21, 2025
Simple multi agent setup with action masking problems RLlib	1	343	June 3, 2025
How valuable are bootstraped values for PPO training RLlib	2	46	June 17, 2025

Why do we not simply delete masked out environment steps in a connector?

Related topics