Last step not counted in off policy estimation

mhall · December 3, 2020, 11:52pm

IS and WIS off policy estimators skip the last step of the episode when calculating their estimates.

    for t in*range(batch.count - 1):
        V_prev += rewards[t] * self.gamma**t
        w_t = self.filter_values[t] / self.filter_counts[t]
        V_step_WIS += p[t] / w_t * rewards[t] * self.gamma**t

Since range is not inclusive, this will produce a range from step 0 to batch.count - 2, skipping the last step of the episode. However, for my use case the last step of the episode typically contains the only reward value.

I could add an extra terminal step to account for this, but I don’t know if this will impact training at all.

felipeeeantunes · December 4, 2020, 12:14pm

This should not be necessary. I’ll submit a PR solving this.

felipeeeantunes · December 4, 2020, 12:21pm

You can follow here.

sven1977 · December 8, 2020, 11:41am

This was fixed by @felipeeeantunes and just merged. Thanks for raising this issue @mhall and thanks for the fix @felipeeeantunes!

Topic		Replies	Views
Understanding agent_timesteps_total RLlib	2	576	February 3, 2023
RLLib steps being sampled and trained but episode count is zero and reward metrics are nan RLlib	1	54	April 3, 2025
Num_agent_steps_trained: 0 Configure Algorithm, Training, Evaluation, Scaling	2	242	May 4, 2024
Handling of Incomplete Episodes in RLlib RLlib	0	24	September 25, 2024
Information about episode reset RLlib	2	220	June 24, 2021

Last step not counted in off policy estimation

Related topics