@mannyv thank you for your valuable inputs and your time.
Can you elaborate on what you are asking here?
What I do: I work on a custom policy that is using a
counter to calculate updates of the policy
state (in this case a moving average that gets returned in the
What I observed was that even though the
state of the policy gets reset at the beginning of a new episode, this
counter variable (a class attribute of the policy) does not. In the second episode this
counter starts where it ended in the last one.
What did I do so far: First solution approach was that I carried over the counter via the policy’s
Then I thought that this counter is actually identical to the
timestep in my environment. So, I now implemented a
timestep variable in the environment and pass it over to the custom policy in the
obs batch (also looked for the
self.global_timestep attribute, but that remained
0 during iterations and the
None). Now, where you brought in the case of multiple environments I start thinking again
What do I need: A counter that gets reset every time a new episode starts. So, maybe the policy’s
state is not that bad a solution.
Furthermore, I also faced a somehow related fact:
It appears that the
agent starts the interaction with the environment and then with the
action of the agent the environment
steps (at least when I use the
trainer_template with my custom policy).
In my case the agent calculates something on the actual observation and puts it into the new
state. This new
state_out_0) and the actual
obs belong logically together in my case. However, as the
trainer loop starts with an agent
action, the resulting
SampleBatch instead holds the
state_out_0 values with the ‘next’
obs values which for my case do not belong logically together.
I guess that it is not possible to turn this around without writing my own