Last step not counted in off policy estimation

IS and WIS off policy estimators skip the last step of the episode when calculating their estimates.

    for t in*range(batch.count - 1):
        V_prev += rewards[t] * self.gamma**t
        w_t = self.filter_values[t] / self.filter_counts[t]
        V_step_WIS += p[t] / w_t * rewards[t] * self.gamma**t

Since range is not inclusive, this will produce a range from step 0 to batch.count - 2, skipping the last step of the episode. However, for my use case the last step of the episode typically contains the only reward value.

I could add an extra terminal step to account for this, but I don’t know if this will impact training at all.

This should not be necessary. I’ll submit a PR solving this.


You can follow here.


This was fixed by @felipeeeantunes and just merged. Thanks for raising this issue @mhall and thanks for the fix @felipeeeantunes!

1 Like