How to using self-definded tensor as padding observations for LSTM/Attention models


If the observation includes action masks, I think for padding all zeros obsevations for lstm is not expected. My code requires non-zeros for action masks.

But current rllib is using zeros to pad the input for lstm models. Related problem can be found here. I tried to replace it with non-zeros padding, but I could not find where to replace the zeros padding to lstm.

@sven1977 Could you help me where is the location for this function? Many thanks.

I am not sure that if checking the action mask is all zero or not for each data in the batched input_dict is good or not. Maybe there is a better solution?

Hi @Shanchao_Yang,

The method that rllib calls to do the padding is here:

The lines that actually pad with 0s is here:

Many thanks! RNN + action masking is not friendly if padding with zero

This is worth a longer discussion. I suppose it depends on your action distribution and its ability to handle zeros. As far as I remember these values are never passed into the environment so they are never used to produce real actions.They also are masked out in the losses, usually, there is still at least 1 algorthim that is not masking currently(marwil).

1 Like

Yes, zero padding is fine if it does not used in forward or calculating loss. Actually my policy model cannot handle all zeros input. My observation shape has variable-length, and I padded it, so I need to store a tensor to tell the real shape. Masking with zeros just cause problems when calling forward function.

The other thing you could do without having to change the underlying library is something like this in your forward function.

def forward(...)
   padded = input_batch["obs_flat" ].sum(axis=1) == 0 #axis might be wrong but I cannot check now
   #handle padded rows


What do you think about having pad_batch_to_sequences_of_same_size add a mask to the sample_batch indicating which rows are real vs padded. There are already many places in the code base that have to reconstruct this mask in some way it would be better and reduce the bug surface to compute and store it once there.

1 Like

Yes, this is a solution for torch, since we can handle the non-padded obs. But the variable-size observation is not friendly to tensorflow models.

@mannyv , thanks for your suggestions! I agree, maybe we should store the boolean mask itself inside the sample batch. This would eliminate some (duplicate and repeated) code, I guess. Worth a try. On the other hand, the information is all completely there inside “seq_lens” and it’s really just doing a e.g. tf.boolean_mask(tf.sequence_mask(seq_lens, max_seq_len)). But yeah, I’d say we’ll do that. Would you like to do a PR to fix this @mannyv ?
Also great catch on MARWIL! :slight_smile: We do say in the docs that it supports RNNs, but it’s not true (I’ll change that). The only off-policy RNN supporting algo afaik is currently R2D2. We can probably take some logic from it regarding burn-in and stuff.