Understanding how loss is computed from training

I am trying to implement the custom and in particular the imitation loss, with tf1 as a frame.
For the first iteration (first batch, which contains 1 episode) I can get a successful weghit update but for the second, apart from specific cases which I will describe later, it ends up with a RunTime error:
reshape inside add_time_dimension (inside forward, before passing training samples to the rnn) asks for a dimension non compatible with mine.
I am trying to use, as replay experience, the output generated directly for a classical training approach, so the format of the batch should be compatible.
The problem seem to be that: each episoded is divided into sequences of a fixed max len (20) and the last one is simply the rest, for example:
episode of lenght 86 → 20,20,20,20,6
as the reshape takes as input the max lenght and, new_batch_size,
which is basically tf(flattened_input).[0] //max_lenght so equal to 4 when I try to reshape I end up with this error, except from the cases in which this division is exact.
Very easy: I have to padd the last sequence to the right len. That´s what I did from the beginnig, and that is why, for any first batch in input it works, because the first batch is taken, padded and made as tensor and everything works correctly.
But for the second batch I don´t understand why it seems to not follow the same procedure anymore, it doesn´t print anything in the middle and I have some intuition that I am missing something because: inside the policy there is a function _initialize_loss_from_dummy_batch and inside the batch reader there is a comment that says “Reading initial batch of data from input reader.”
I am padding inside tf_input_ops which is called into the forward and so it should always work, but that is not the case.
how can I solve the problem and how is different the way first batch is treated from the second? why it is not padded?

I could easily solve it in eagle_mode with tf 2 just making sure to pass before the reshape but the policy is using tf1 and eagle_mode is False, so I also wonder if moving to tf2 as a framework for the policy would be time consuming at this point. what is, in general, necessary to adapt for makinig it work?

Hi @michele-pel,

Can you share a reproduction script. I am going to need that to offer suggestions effectively.

In the mean time here are some posts that contain information about how training with an lstm work. You.ay find some useful information in them.