LSTM/Attn wrappers take the policy output instead of the latent features

vakker00 · December 15, 2021, 9:42pm

In the current implementation it seems like that if use_lstm (or use_attention) is true then it wraps the provided policy with the LSTM or Attention wrapper in a weird way.

During the forward pass the pseudo code is (see here):

# LSTM wrapper's forward
def forward(input):
    wrapped_out = wrapped.forward(input)
    return forward_rnn(wrapped_out)

Shouldn’t the recurrent module get the latent features instead of the output of the logits_branch? Or am I missing something and if the wrapper is used then the wrapped policy outputs the features? I.e. instead of doing this:

# Wrapped policy's forward
def forward(inputs):
    features = convs(inputs)
    return logits_branch(features)

Based on this paper and the Stable Baselines implementation it seems like that the latent state should be passed to the recurrent module (whichever is being used).

mannyv · December 16, 2021, 9:01pm

Hi @vakker00,

In the standard rllib models there is really no difference between latent feature layers and logot action layers other than where they occur in the architecture and their sizes. Latent layers being in the middle with sizes specified by fcnet_hiddens and logits being at the end with size determined by the action space. When you request a wrapper the model catalog will create and use a final layer in the wrapper.

The wrapper methods were added after the original models were written. When they were added they ket the original naming convention which is where the confusion comes from.

vakker00 · December 17, 2021, 12:40pm

@mannyv thanks for the response, there is no confusion about the “how”. The question is rather about the “why”, and the theoretical consideration of where the input to the recurrent wrapper should be extracted from.

RLlib’s default implementation takes the output of the policy and feeds it to the recurrent layers.
The referenced paper and the SB implementation takes the hidden state of the policy (e.g. the output of a CNN encoder) and not the output (which typically has significantly lower information content).

So the question is: is there a publication that the RLlib implementation of recurrent policies is based on?

mannyv · December 17, 2021, 2:28pm

@vakker00,

My understanding is that SB3 does not have LSTM support, I just did a quick search to see if that has changed and I could not find that it had. If I missed it please do point it out to me.

SB2 has LSTM support but as far as I can tell, looking at the code, the “default” method does it the same way as rllib does. It depends on if the user specifies provides a net_arch parameter or not:

If the user does not then it does it just like rllib does here:

github.com

hill-a/stable-baselines/blob/14630fbac70aaa633f8a331c8efac253d3ed6115/stable_baselines/common/policies.py#L422

    
      
              layers = [64, 64]
          else:
              warnings.warn("The layers parameter is deprecated. Use the net_arch parameter instead.")
          
          
with tf.variable_scope("model", reuse=reuse):
              if feature_extraction == "cnn":
                  extracted_features = cnn_extractor(self.processed_obs, **kwargs)
              else:
                  extracted_features = tf.layers.flatten(self.processed_obs)
                  for i, layer_size in enumerate(layers):
                      extracted_features = act_fun(linear(extracted_features, 'pi_fc' + str(i), n_hidden=layer_size,
                                                          init_scale=np.sqrt(2)))
              input_sequence = batch_to_seq(extracted_features, self.n_env, n_steps)
              masks = batch_to_seq(self.dones_ph, self.n_env, n_steps)
              rnn_output, self.snew = lstm(input_sequence, masks, self.states_ph, 'lstm1', n_hidden=n_lstm,
                                           layer_norm=layer_norm)
              rnn_output = seq_to_batch(rnn_output)
              value_fn = linear(rnn_output, 'vf', 1)
          
          
    self._proba_distribution, self._policy, self.q_value = \
                  self.pdtype.proba_distribution_from_latent(rnn_output, rnn_output)

If the user does then it is up to the discretion of the user:
for example you can do it with net_arch=[64,'lstm',...] or without net_arch=['lstm',...]:

github.com

hill-a/stable-baselines/blob/14630fbac70aaa633f8a331c8efac253d3ed6115/stable_baselines/common/policies.py#L449-L461

    
      
          if isinstance(layer, int):  # Check that this is a shared layer
              layer_size = layer
              latent = act_fun(linear(latent, "shared_fc{}".format(idx), layer_size, init_scale=np.sqrt(2)))
          elif layer == "lstm":
              if lstm_layer_constructed:
                  raise ValueError("The net_arch parameter must only contain one occurrence of 'lstm'!")
              input_sequence = batch_to_seq(latent, self.n_env, n_steps)
              masks = batch_to_seq(self.dones_ph, self.n_env, n_steps)
              rnn_output, self.snew = lstm(input_sequence, masks, self.states_ph, 'lstm1', n_hidden=n_lstm,
                                           layer_norm=layer_norm)
              latent = seq_to_batch(rnn_output)
              lstm_layer_constructed = True
          else:

I am not a heavy SB user so I may have gotten it wrong so please do correct any misunderstandings.

On more general comment. The forum has users with a very wide spectrum of RL experience from very experienced to oh man they asked me to do an RL project yesterday what is it? With this in mind I try to assume very little about their background when someone asks a question so that it might be more broadly useful.

vakker00 · December 17, 2021, 3:34pm

@mannyv thanks for the reply. Yes, it’s only in SB2, the link in the original post points to that (also see the linked paper, that has a clearer illustration).

When you request a wrapper the model catalog will create and use a final layer in the wrapper.

To me it seems like that the section that you mentioned (with or without net_arch) clearly applies the RNN on the latent features BEFORE any of the policy or vf layers. I think this might also be the case for RLlib, but it’s slightly obfuscated.

I think what’s happening is that when the recurrent wrapper is used, the num_outputs that’s used to instantiate the wrapped class is None, so self._logits is also None and self.last_layer_is_flattened is False. This leads to the wrapped class’ forward function to use the return conv_out, state case instead of return logits, state (see here), i.e. it passes the latent vector to the wrapper and not the output of self._logits.

So if that’s correct, then I think it actually clarifies my main question:

Or am I missing something and if the wrapper is used then the wrapped policy outputs the features?

Yes, when the wrapper is used then the wrapped policy passes the extracted features to the wrapper.

Edit: one thing that might be a potential bug, that the post_fcnet_hiddens layers are not applied when the wrapper is used (_logits_branch should be constructed similarly to the wrapped class’ _logits). It was added recently to VisionNetwork (and I guess to others as well), and maybe that change was not implemented in the wrappers.

mannyv · December 17, 2021, 5:40pm

@vakker00,

Looking at the cnn module in SB2. Doesn’t it pass the CNN features to a linear layer as the last operation?

github.com

hill-a/stable-baselines/blob/14630fbac70aaa633f8a331c8efac253d3ed6115/stable_baselines/common/policies.py#L29

    
      
          
          
    :param scaled_images: (TensorFlow Tensor) Image input placeholder
              :param kwargs: (dict) Extra keywords parameters for the convolutional layers of the CNN
              :return: (TensorFlow Tensor) The CNN output layer
              """
              activ = tf.nn.relu
              layer_1 = activ(conv(scaled_images, 'c1', n_filters=32, filter_size=8, stride=4, init_scale=np.sqrt(2), **kwargs))
              layer_2 = activ(conv(layer_1, 'c2', n_filters=64, filter_size=4, stride=2, init_scale=np.sqrt(2), **kwargs))
              layer_3 = activ(conv(layer_2, 'c3', n_filters=64, filter_size=3, stride=1, init_scale=np.sqrt(2), **kwargs))
              layer_3 = conv_to_fc(layer_3)
              return activ(linear(layer_3, 'fc1', n_hidden=512, init_scale=np.sqrt(2)))
          
          

          
def mlp_extractor(flat_observations, net_arch, act_fun):
              """
              Constructs an MLP that receives observations as an input and outputs a latent representation for the policy and
              a value network. The ``net_arch`` parameter allows to specify the amount and size of the hidden layers and how many
              of them are shared between the policy network and the value network. It is assumed to be a list with the following
              structure:
          
          
    1. An arbitrary length (zero allowed) number of integers each specifying the number of units in a shared layer.

Topic		Replies	Views
Wrapping Rllib's Built-In Wrappers RLlib	3	542	April 28, 2021
LSTM Auto Wrapper RLlib	6	1531	October 2, 2021
Question on code: _wrapped_forward RLlib	2	345	November 30, 2021
RLLIB LSTM model summary view RLlib	1	797	March 31, 2023
'use_lstm' wrapping in older and newer Ray versions RLlib	0	626	March 16, 2022

LSTM/Attn wrappers take the policy output instead of the latent features

Related topics