Does a policy ever get resetted?

Hi folks,

I have a custom policy in which I have a variable named period, which I increase by 1 in each call to compute_actions(). After taking a look at my data, I found out, that this variable keeps counting after an episode end.

My workaround now would be to carry this variable in the state of the policy as the state gets resetted at the beginning of a new episode.

Am I correct in assuming that a policy itself does not get resetted during the run of a trial (many episodes)? Is this intended, and if so why? @sven1977 @mannyv

Thanks for your time

Hi @Lars_Simon_Zehnder,

Can you elaborate on what you are asking here? The state of the policy is independent of the episode. There are some properties in some of the policies that depend on the number of total steps that have occurred. For example in exploration objects.

You could implement this yourself in on_episode_{start/end}. That callback has access to both the environment and the policy. But beware, if your num_envs_per_worker is > 1 then there be dragons here.

@mannyv thank you for your valuable inputs and your time.

Can you elaborate on what you are asking here?

What I do: I work on a custom policy that is using a counter to calculate updates of the policy state (in this case a moving average that gets returned in the state_out_0).

What I observed was that even though the state of the policy gets reset at the beginning of a new episode, this counter variable (a class attribute of the policy) does not. In the second episode this counter starts where it ended in the last one.

What did I do so far: First solution approach was that I carried over the counter via the policy’s state.
Then I thought that this counter is actually identical to the timestep in my environment. So, I now implemented a timestep variable in the environment and pass it over to the custom policy in the obs batch (also looked for the self.global_timestep attribute, but that remained 0 during iterations and the timestep in compute_action was None). Now, where you brought in the case of multiple environments I start thinking again :smiley:

What do I need: A counter that gets reset every time a new episode starts. So, maybe the policy’s state is not that bad a solution.

Furthermore, I also faced a somehow related fact:

  • It appears that the agent starts the interaction with the environment and then with the action of the agent the environment steps (at least when I use the trainer_template with my custom policy).

  • In my case the agent calculates something on the actual observation and puts it into the new state. This new state (state_out_0) and the actual obs belong logically together in my case. However, as the trainer loop starts with an agent action, the resulting SampleBatch instead holds the state_out_0 values with the ‘next’ obs values which for my case do not belong logically together.

I guess that it is not possible to turn this around without writing my own Sampler?

Hi @Lars_Simon_Zehnder,

You are tracking this in your environment I am sure. A class member that is set to 0/1 when rest is called.

You could carry it in the state but you could also put it in the observation with a dict or Tuple observation. Maybe that does not work for you?

From what I remember the last time I looked at that code, when rest is called the obs is placed in the episode object and the other fields have dummy values but once you call step the rest of the values are filled in. By the second call to step on that episode all of the values are associated with the correct timesteps so the state_in and _out of the first step will be correct.

I do not have time to double check this this weekend. But if you want to check,

This is where the obs is placed in a new episode.

This is where the sample for the first (and all other) timestep is made consistent in the sample.