Does a policy ever get resetted?

Lars_Simon_Zehnder · July 29, 2021, 5:59pm

Hi folks,

I have a custom policy in which I have a variable named period, which I increase by 1 in each call to compute_actions(). After taking a look at my data, I found out, that this variable keeps counting after an episode end.

My workaround now would be to carry this variable in the state of the policy as the state gets resetted at the beginning of a new episode.

Am I correct in assuming that a policy itself does not get resetted during the run of a trial (many episodes)? Is this intended, and if so why? @sven1977 @mannyv

Thanks for your time
Simon

mannyv · July 30, 2021, 3:02am

Hi @Lars_Simon_Zehnder,

Can you elaborate on what you are asking here? The state of the policy is independent of the episode. There are some properties in some of the policies that depend on the number of total steps that have occurred. For example in exploration objects.

You could implement this yourself in on_episode_{start/end}. That callback has access to both the environment and the policy. But beware, if your num_envs_per_worker is > 1 then there be dragons here.

https://docs.ray.io/en/master/rllib-training.html#ray.rllib.agents.callbacks.DefaultCallbacks.on_episode_start

Lars_Simon_Zehnder · July 30, 2021, 10:55am

@mannyv thank you for your valuable inputs and your time.

Can you elaborate on what you are asking here?

What I do: I work on a custom policy that is using a counter to calculate updates of the policy state (in this case a moving average that gets returned in the state_out_0).

What I observed was that even though the state of the policy gets reset at the beginning of a new episode, this counter variable (a class attribute of the policy) does not. In the second episode this counter starts where it ended in the last one.

What did I do so far: First solution approach was that I carried over the counter via the policy’s state.
Then I thought that this counter is actually identical to the timestep in my environment. So, I now implemented a timestep variable in the environment and pass it over to the custom policy in the obs batch (also looked for the self.global_timestep attribute, but that remained 0 during iterations and the timestep in compute_action was None). Now, where you brought in the case of multiple environments I start thinking again

What do I need: A counter that gets reset every time a new episode starts. So, maybe the policy’s state is not that bad a solution.

Furthermore, I also faced a somehow related fact:

It appears that the agent starts the interaction with the environment and then with the action of the agent the environment steps (at least when I use the trainer_template with my custom policy).
In my case the agent calculates something on the actual observation and puts it into the new state. This new state (state_out_0) and the actual obs belong logically together in my case. However, as the trainer loop starts with an agent action, the resulting SampleBatch instead holds the state_out_0 values with the ‘next’ obs values which for my case do not belong logically together.

I guess that it is not possible to turn this around without writing my own Sampler?

mannyv · July 31, 2021, 11:19am

Hi @Lars_Simon_Zehnder,

You are tracking this in your environment I am sure. A class member that is set to 0/1 when rest is called.

You could carry it in the state but you could also put it in the observation with a dict or Tuple observation. Maybe that does not work for you?

mannyv · July 31, 2021, 11:37am

From what I remember the last time I looked at that code, when rest is called the obs is placed in the episode object and the other fields have dummy values but once you call step the rest of the values are filled in. By the second call to step on that episode all of the values are associated with the correct timesteps so the state_in and _out of the first step will be correct.

I do not have time to double check this this weekend. But if you want to check,

This is where the obs is placed in a new episode.

github.com

ray-project/ray/blob/96c69f8c77074cf9161aafe20bd5ba229a1dd436/rllib/evaluation/sampler.py#L937-L964


      
          elif resetted_obs != ASYNC_RESET_RETURN:
              new_episode: MultiAgentEpisode = active_episodes[env_id]
              if observation_fn:
                  resetted_obs: Dict[AgentID, EnvObsType] = observation_fn(
                      agent_obs=resetted_obs,
                      worker=worker,
                      base_env=base_env,
                      policies=worker.policy_map,
                      episode=new_episode)
              # types: AgentID, EnvObsType
              for agent_id, raw_obs in resetted_obs.items():
                  policy_id: PolicyID = new_episode.policy_for(agent_id)
                  prep_obs: EnvObsType = _get_or_raise(
                      worker.preprocessors, policy_id).transform(raw_obs)
                  filtered_obs: EnvObsType = _get_or_raise(
                      worker.filters, policy_id)(prep_obs)
                  new_episode._set_last_observation(agent_id, filtered_obs)
          
                  # Add initial obs to buffer.
                  sample_collector.add_init_obs(

This file has been truncated. show original

This is where the sample for the first (and all other) timestep is made consistent in the sample.

github.com

ray-project/ray/blob/96c69f8c77074cf9161aafe20bd5ba229a1dd436/rllib/evaluation/sampler.py#L801


      
                  worker=worker,
                  base_env=base_env,
                  policies=worker.policy_map,
                  episode=episode)
              if not isinstance(all_agents_obs, dict):
                  raise ValueError(
                      "observe() must return a dict of agent observations")
          
          # For each agent in the environment.
          # types: AgentID, EnvObsType
          for agent_id, raw_obs in all_agents_obs.items():
              assert agent_id != "__all__"
          
              last_observation: EnvObsType = episode.last_observation_for(
                  agent_id)
              agent_done = bool(all_agents_done or dones[env_id].get(agent_id))
          
              # A new agent (initial obs) is already done -> Skip entirely.
              if last_observation is None and agent_done:
                  continue

Topic		Replies	Views
Extract and display policy RLlib	3	485	July 26, 2021
[rllib] Will the hidden state of an rnn policy be reset by default at the end of an episode? RLlib	1	362	June 1, 2021
When does an environment reset()? RLlib	5	1553	February 7, 2023
Callbacks.on_episode_step called an extra time during the first episode played (after the first call to env.reset) RLlib	5	819	April 9, 2021
Trainer is calling reset() even if the trial should have stop RLlib	2	268	May 30, 2022

Does a policy ever get resetted?

Related topics