How severe does this issue affect your experience of using Ray?
I am trying to square what I am seeing in appo and for instance equation (4) here for off policy updates.
It seems like in the paper eq (4), the importance sampling ratio is “target policy / behavior policy”. However, in the appo code it seems like it is doing “behavior policy / target policy”.
old_policy_actions_logp results from the detached target model output while the
prev_actions_logp results from the
behaviour_logits (which comes from the train batch).
In other words, it seems like this line should be
is_ratio = torch.clamp( torch.exp(old_policy_actions_logp - prev_actions_logp), 0.0, 2.0 )
is_ratio = torch.clamp( torch.exp(prev_actions_logp - old_policy_actions_logp), 0.0, 2.0 )
I am guessing I am wrong about this but am not sure why. Any insights would be appreciated. Thanks!