Importance sampling in appo

esquires · June 28, 2022, 7:30pm

How severe does this issue affect your experience of using Ray?
Low

I am trying to square what I am seeing in appo and for instance equation (4) here for off policy updates.

It seems like in the paper eq (4), the importance sampling ratio is “target policy / behavior policy”. However, in the appo code it seems like it is doing “behavior policy / target policy”.

In particular, old_policy_actions_logp results from the detached target model output while the prev_actions_logp results from the behaviour_logits (which comes from the train batch).

In other words, it seems like this line should be

is_ratio = torch.clamp(
     torch.exp(old_policy_actions_logp - prev_actions_logp), 0.0, 2.0
)

rather than

is_ratio = torch.clamp(
    torch.exp(prev_actions_logp - old_policy_actions_logp), 0.0, 2.0
)

I am guessing I am wrong about this but am not sure why. Any insights would be appreciated. Thanks!

Topic		Replies	Views
What is var_IS in APPO? RLlib	4	380	February 17, 2023
Why does off policy estimator use log likelyhood for importance ratio? RLlib	1	279	December 3, 2020
Non acting agents in APPO RLlib	2	263	January 26, 2022
Action Masking Model: Deterministic selection of the best action RLlib	0	27	August 11, 2024
Offline data and off-policy estimation RLlib	4	711	July 20, 2022

Importance sampling in appo

Related topics