IS and WIS off policy estimators skip the last step of the episode when calculating their estimates.
for t in*range(batch.count - 1): V_prev += rewards[t] * self.gamma**t w_t = self.filter_values[t] / self.filter_counts[t] V_step_WIS += p[t] / w_t * rewards[t] * self.gamma**t
Since range is not inclusive, this will produce a range from step 0 to batch.count - 2, skipping the last step of the episode. However, for my use case the last step of the episode typically contains the only reward value.
I could add an extra terminal step to account for this, but I don’t know if this will impact training at all.