Why does a SampleBatch contain a different number of elements for the hidden states of the RNN than for the obs, actions, advantages...?

I have written a simple training procedure using the standard API calls. For example, I get the following output on the command line (shortened version):

2021-06-02 17:10:13,390 INFO rollout_worker.py:741 -- Completed sample batch:

      'actions': np.ndarray((167,), dtype=int64, min=0.0, max=6.0, mean=2.976),
      'advantages': np.ndarray((167,), dtype=float32, min=-2.164, max=2.518, mean=-0.077),
      'new_obs': np.ndarray((167, 11), dtype=float32, min=-0.906, max=1.0, mean=0.496),
      'obs': np.ndarray((167, 11), dtype=float32, min=-0.906, max=1.0, mean=0.496),
      'rewards': np.ndarray((167,), dtype=float32, min=-2.2, max=2.48, mean=-0.111),
      'state_in_0': np.ndarray((17, 128), dtype=float32, min=-0.298, max=0.214, mean=-0.002),
      'state_in_1': np.ndarray((17, 128), dtype=float32, min=-0.532, max=0.398, mean=-0.003),
      'state_out_0': np.ndarray((167, 128), dtype=float32, min=-0.301, max=0.215, mean=-0.002),
      'state_out_1': np.ndarray((167, 128), dtype=float32, min=-0.536, max=0.4, mean=-0.003),

I would like to understand why there are only 17 elements for state_in_[0|1], while there are 167 elements for all other variables.

The state_in values are only added when step t in the environment satisfies the following condition t % max_seq_len == 0.

1 Like

I thought the hidden state is passed over the course of an entire episode, not just max_seq_len time steps.

Hey @LukasNothhelfer , great question. This is sometimes the case when the batch is already “reduced” to the init internal states beginning at the given timesteps provided by “seq_lens”.

obs=0 1 2 3 4 5 6 7 8
seq_lens=4, 5
state_in_0=1 2 (<- 1 is the init state for obs 0, 2 is the init state for obs 4)

This saves some space in the batch as all the intermediate states are normally not needed by the loss functions (they are re-computed by e.g. an LSTM anyways).

1 Like