Hi,
I am working on a custom policy that uses the state_batches in compute_actions to keep track of an internal policy state, which gets updated at each timestep of the environment (think of an updated expectation value of observations). I use the Trajectory View API with the following settings in my policy’s __init__():
self.view_requirements['state_in_0'] = \
            ViewRequirement('state_out_0',
                            shift=-1,
                            used_for_training=False,
                            used_for_compute_actions=True)
The initial state is defined as:
# Initial state in custome policy:
def get_initial_state(self):
    return [np.zeros(8,dtype=np.float64)]
This initial state btw gets already shaped to the following when arriving in compute_actions():
# What happened here? 
[array([[0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)]
Furthermore, I return as second object in the compute_actions() function of my policy a list of BATCH_SIZE numpy arrays of shape (STATE_SIZE,).
When I analyze the SampleBatch of an episode after training I can see that the state_out_0 is not identical to the state_in_0 of the next timestep (why? normalization?):
print(batch.__getitem__('state_in_0')[:2])
print(batch.__getitem__('state_out_0')[:2])
[[ 0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.05122958  1.07819295  1.02973902  1.037444    0.89701152 -0.04827012
  -0.0305887  -0.01290726] # <- this should equal ] 
[[0.02288876 0.58843416 0.31873515 0.56406111 0.29287788 0.
  0.         0.        ] # <- this
 [0.0270972  0.74560356 0.45348084 0.71006745 0.40950587 0.0121695
  0.00722509 0.00228068]]
I took a look at the definition of compute_actions() which returns the new state_batches in shape [STATE_SIZE, BATCH_SIZE] and type List[TensorType]. So I thought I have to change the output shape and did so:
# [STATE_SIZE, BATCH_SIZE] = [8, 1]
[array([0.02288876]), array([0.58843414]), array([0.31873516]), array([0.56406113]), array([0.29287789]), array([0.]), array([0.]), array([0.])]
However, in the next timestep the state_in_0 variable has the shape:
[array([0.02288876])]
which gives necessarily an error. I am confused. Can anyone tell me, how to correctly define the initial state and return the state_batches? (maybe give a hint where in the source code to find the processing of the state_batches)
Thanks for your help
Simon
 you spelled it out for me -