Hi,
I am trying to log logp_ratio that is implemented in /agents/ppo/ppo_tf_policy.py:
logp_ratio = tf.exp(
curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -
train_batch[SampleBatch.ACTION_LOGP])
I am trying to do this in the following manner:
class MyCallbacks(DefaultCallbacks):
def on_learn_on_batch(self, *, policy, train_batch, result: dict, **kwargs):
if "prev_logp" not in result:
result["logp_ration"] = 1
else:
result["logp_ration"] = np.exp(result["prev_logp"] - train_batch["action_logp"])
result["prev_logp"] = train_batch["action_logp"]
But obviously every time i am in on_learn_on_batch, results dict is empty.
In what way can I to store such values?
My high level idea is to debug more deeply what’s going on in PPO surrogate function by logging it on wandb (W&B) platform.
Thanks,
Jakub