Hi,
I am working on a custom policy that uses the state_batches
in compute_actions
to keep track of an internal policy state, which gets updated at each timestep of the environment (think of an updated expectation value of observations). I use the Trajectory View API with the following settings in my policy’s __init__()
:
self.view_requirements['state_in_0'] = \
ViewRequirement('state_out_0',
shift=-1,
used_for_training=False,
used_for_compute_actions=True)
The initial state is defined as:
# Initial state in custome policy:
def get_initial_state(self):
return [np.zeros(8,dtype=np.float64)]
This initial state btw gets already shaped to the following when arriving in compute_actions
():
# What happened here?
[array([[0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)]
Furthermore, I return as second object in the compute_actions()
function of my policy a list of BATCH_SIZE
numpy arrays of shape (STATE_SIZE,)
.
When I analyze the SampleBatch
of an episode after training I can see that the state_out_0
is not identical to the state_in_0
of the next timestep (why? normalization?):
print(batch.__getitem__('state_in_0')[:2])
print(batch.__getitem__('state_out_0')[:2])
[[ 0. 0. 0. 0. 0. 0.
0. 0. ]
[-0.05122958 1.07819295 1.02973902 1.037444 0.89701152 -0.04827012
-0.0305887 -0.01290726] # <- this should equal ]
[[0.02288876 0.58843416 0.31873515 0.56406111 0.29287788 0.
0. 0. ] # <- this
[0.0270972 0.74560356 0.45348084 0.71006745 0.40950587 0.0121695
0.00722509 0.00228068]]
I took a look at the definition of compute_actions()
which returns the new state_batches
in shape [STATE_SIZE, BATCH_SIZE]
and type List[TensorType]
. So I thought I have to change the output shape and did so:
# [STATE_SIZE, BATCH_SIZE] = [8, 1]
[array([0.02288876]), array([0.58843414]), array([0.31873516]), array([0.56406113]), array([0.29287789]), array([0.]), array([0.]), array([0.])]
However, in the next timestep the state_in_0
variable has the shape:
[array([0.02288876])]
which gives necessarily an error. I am confused. Can anyone tell me, how to correctly define the initial state and return the state_batches
? (maybe give a hint where in the source code to find the processing of the state_batches
)
Thanks for your help
Simon