Why does off policy estimator use log likelyhood for importance ratio?

mhall · December 3, 2020, 7:22pm

When calculating is and wis off policy estimations, importance ratio is calculated using log probably of new policy over the old probability. This is causing me to get important ratios of -1.25, for example, which over the episode causes the estimations to flop between increasingly large negative and positive values.

From reading papers about important ratios and off policy estimation (for example the one linked in the code https://arxiv.org/pdf/1511.03722.pdf), it looks like it should just be a ratio of probabilities? That makes a lot more sense to me personally.

Is this a bug? Or am I misunderstanding something? Snippet of relevant code below

@override(OffPolicyEstimator)
def estimate(self, batch: SampleBatchType) -> OffPolicyEstimate:
    <snip>
    **new_prob = self.action_prob(batch)**
    # calculate importance ratios
    p = []
    for t in range(batch.count - 1):
        if t == 0:
            pt_prev = 1.0
        else:
            pt_prev = p[t - 1]
        p.append(**pt_prev * new_prob[t] / old_prob[t]**)

def action_prob(self, batch: SampleBatchType) -> np.ndarray:
    """Returns the probs for the batch actions for the current policy."""
    <snip>
    **log_likelihoods**: TensorType = **self.policy.compute_log_likelihoods**(
        actions=batch[SampleBatch.ACTIONS],
        obs_batch=batch[SampleBatch.CUR_OBS],
        state_batches=[batch[k] for k in state_keys],
        prev_action_batch=batch.data.get(SampleBatch.PREV_ACTIONS),
        prev_reward_batch=batch.data.get(SampleBatch.PREV_REWARDS))
    return convert_to_numpy(**log_likelihoods**)

felipeeeantunes · December 3, 2020, 10:15pm

Hi @mhall, I found this bug some weeks ago and Sven submitted a PR to solve it:

Topic		Replies	Views
Importance sampling in appo RLlib	0	216	June 28, 2022
Logging logp_ratio in PPO using custom callbacks RLlib	0	293	October 21, 2021
Policy.compute_log_likelihoods should allows to compute with/without applying the exploration (e.g. SoftQ exploration) RLlib	1	271	April 16, 2021
Confused by output of `compute_log_likelihoods` RLlib	0	324	March 28, 2022
Doubly Robust off-policy estimation method RLlib	6	461	August 3, 2022

Why does off policy estimator use log likelyhood for importance ratio?

Related topics