Importance sampling in appo

How severe does this issue affect your experience of using Ray?
Low

I am trying to square what I am seeing in appo and for instance equation (4) here for off policy updates.

It seems like in the paper eq (4), the importance sampling ratio is “target policy / behavior policy”. However, in the appo code it seems like it is doing “behavior policy / target policy”.

In particular, old_policy_actions_logp results from the detached target model output while the prev_actions_logp results from the behaviour_logits (which comes from the train batch).

In other words, it seems like this line should be

is_ratio = torch.clamp(
     torch.exp(old_policy_actions_logp - prev_actions_logp), 0.0, 2.0
)

rather than

is_ratio = torch.clamp(
    torch.exp(prev_actions_logp - old_policy_actions_logp), 0.0, 2.0
)

I am guessing I am wrong about this but am not sure why. Any insights would be appreciated. Thanks!