How severe does this issue affect your experience of using Ray?
Low
I am trying to square what I am seeing in appo and for instance equation (4) here for off policy updates.
It seems like in the paper eq (4), the importance sampling ratio is “target policy / behavior policy”. However, in the appo code it seems like it is doing “behavior policy / target policy”.
In particular, old_policy_actions_logp
results from the detached target model output while the prev_actions_logp
results from the behaviour_logits
(which comes from the train batch).
In other words, it seems like this line should be
is_ratio = torch.clamp(
torch.exp(old_policy_actions_logp - prev_actions_logp), 0.0, 2.0
)
rather than
is_ratio = torch.clamp(
torch.exp(prev_actions_logp - old_policy_actions_logp), 0.0, 2.0
)
I am guessing I am wrong about this but am not sure why. Any insights would be appreciated. Thanks!