When calculating is and wis off policy estimations, importance ratio is calculated using log probably of new policy over the old probability. This is causing me to get important ratios of -1.25, for example, which over the episode causes the estimations to flop between increasingly large negative and positive values.
From reading papers about important ratios and off policy estimation (for example the one linked in the code https://arxiv.org/pdf/1511.03722.pdf), it looks like it should just be a ratio of probabilities? That makes a lot more sense to me personally.
Is this a bug? Or am I misunderstanding something? Snippet of relevant code below
@override(OffPolicyEstimator) def estimate(self, batch: SampleBatchType) -> OffPolicyEstimate: <snip> **new_prob = self.action_prob(batch)** # calculate importance ratios p =  for t in range(batch.count - 1): if t == 0: pt_prev = 1.0 else: pt_prev = p[t - 1] p.append(**pt_prev * new_prob[t] / old_prob[t]**) def action_prob(self, batch: SampleBatchType) -> np.ndarray: """Returns the probs for the batch actions for the current policy.""" <snip> **log_likelihoods**: TensorType = **self.policy.compute_log_likelihoods**( actions=batch[SampleBatch.ACTIONS], obs_batch=batch[SampleBatch.CUR_OBS], state_batches=[batch[k] for k in state_keys], prev_action_batch=batch.data.get(SampleBatch.PREV_ACTIONS), prev_reward_batch=batch.data.get(SampleBatch.PREV_REWARDS)) return convert_to_numpy(**log_likelihoods**)