How to reset rnn states at episode end in a torch model?

LukasNothhelfer · May 31, 2021, 6:30pm

Hello,
I have a simple model with a few Fully Connected layers and an LSTM layer in between. The method interfaces are well explained in the documentation (e.g. forward_rnn) and I was able to implement them. However, I wonder how the hidden state of the recurrent model is reset at the end of an episode. Does RLlib do this automatically (and if so where) or do I have to take care of it myself? What are best practices here?

sven1977 · June 1, 2021, 11:19am

Hey @LukasNothhelfer , RLlib automatically resets the internal state at the beginning of an episode. Note that the internal state is not saved inside the model, but “carried” by the RolloutWorkers and its SampleCollectors. At the beginning of an episode, RLlib uses the initial state defined either by the model via its get_initial_state method, or by the model’s view-requirements dict.
In other words, you should be fine. You can maybe print out the state tensors being passed into your forward passes to confirm they are all 0.0s (or whatever init value you defined) at the beginning of each episode?

LukasNothhelfer · June 1, 2021, 12:32pm

Hello @sven1977 , thanks for the quick feedback. How can I find out from the input_dict passed to the forward method, or from inputs passed to the forward_rnn method, at which index in the batch a new episode was started and thus zeros are to be expected in the corresponding states? I only get the done flag in input_dict and in the inputs for forward_rnn this info is not present at all. Additionally, I would be interested to know where in the source code this mechanism is built in to reset the hidden state at the beginning of an episode. Can you help me here?

BTW, I think a hint in the documentation would be helpful for many, as it eliminates many questions especially for newcomers like me. In tensorforce, for example, appropriate hints are given (see Layers — Tensorforce 0.6.3 documentation, there it says: “RNN consequently maintains a temporal internal state over the course of an episode.”

mannyv · June 1, 2021, 11:17pm

The function that ultimately resets the state for a new episode is here:

github.com

ray-project/ray/blob/35ec91c4e04c67adc7123aa8461cf50923a316b4/rllib/evaluation/episode.py#L178

    
      
              """Returns the previous reward for the specified agent."""
          
          
    history = self._agent_reward_history[agent_id]
              if len(history) >= 2:
                  return history[-2]
              else:
                  # We're at t=0, so there is no previous reward, just return zero.
                  return 0.0
          
          
@DeveloperAPI
          def rnn_state_for(self, agent_id: AgentID = _DUMMY_AGENT_ID) -> List[Any]:
              """Returns the last RNN state for the specified agent."""
          
          
    if agent_id not in self._agent_to_rnn_state:
                  policy = self._policies[self.policy_for(agent_id)]
                  self._agent_to_rnn_state[agent_id] = policy.get_initial_state()
              return self._agent_to_rnn_state[agent_id]
          
          
@DeveloperAPI
          def last_pi_info_for(self, agent_id: AgentID = _DUMMY_AGENT_ID) -> dict:
              """Returns the last info object for the specified agent."""

Essentially what happens is that when an episode is finished, a new one is created at the code linked below. When that episode is created it will not have a state entry for any agents since they have never stepped the environment. So what happens, is it calls get_initial_state for the policy mapped to that agent.

github.com

ray-project/ray/blob/ebc44c3d76d114e6192d697e3715aa73bc66924d/rllib/evaluation/sampler.py#L863

    
      
                  if "infos" in pol.view_requirements:
                      values_dict["infos"] = agent_infos
                  sample_collector.add_action_reward_next_obs(
                      episode.episode_id, agent_id, env_id, policy_id,
                      agent_done, values_dict)
          
          
    if not agent_done:
                  item = PolicyEvalData(
                      env_id, agent_id, filtered_obs, agent_infos, None
                      if last_observation is None else
                      episode.rnn_state_for(agent_id), None
                      if last_observation is None else
                      episode.last_action_for(agent_id),
                      rewards[env_id][agent_id] or 0.0)
                  to_eval[policy_id].append(item)
          
          
# Invoke the `on_episode_step` callback after the step is logged
          # to the episode.
          # Exception: The very first env.poll() call causes the env to get reset
          # (no step taken yet, just a single starting observation logged).
          # We need to skip this callback in this case.

As for knowing when you are at the beginning of an episode when you are training I do not know a good way to get that info.

LukasNothhelfer · June 1, 2021, 11:32pm

Hello @mannyv. Thx for your reply. I recently stumbled across the code snippet as well. I was wondering why during my debug sessions, the get_initial_state method is only called to create the trainer and not after. I would at least expect it to be called more often (namely, right when a new episode starts). I set breakpoints with the ray debugger but during the run this method was simply never called. I will check this and inform the discussion here accordingly if I could confirm this. Thanks for your help.

mannyv · June 1, 2021, 11:56pm

Yeah there was a bug that is fixed in the nightly builds but has not made it into a release yet.

Details are here:

https://github.com/ray-project/ray/issues/15483

LukasNothhelfer · June 2, 2021, 2:46am

@mannyv Thanks, that helps a lot

Topic		Replies	Views
[rllib] Will the hidden state of an rnn policy be reset by default at the end of an episode? RLlib	1	358	June 1, 2021
[RLlib] Workaround for incorrect initial state shape with custom RNN models? RLlib	2	372	January 2, 2021
Implementing a custom RNN using the TorchModelV2 RLlib	1	639	December 16, 2022
States of Recurrent models for multiple workers/envs RLlib	1	303	April 14, 2021
Variable size recurrent state RLlib	4	233	May 4, 2021

How to reset rnn states at episode end in a torch model?

Related topics