I find recommended way to normalize rewards in Normalize reward. The recommended way is to use callback function.
However, the execution sequence of PPO calls is as follows:
-
postprocess_ppo_gae
: calculate adv using gae or not. The adv is calculated and recorded in SampleBatch. -
on_postprocess_trajectory
: callback called after a policy’s postprocess_fn is called.
As we can see, if I normalize the rewards in on_postprocess_trajectory
, it does not affect the calculation of adv. It seems that modifying the rewards value in the on_postprocess_trajectory
callback does not affect the training results.
Besides, I can custom the postprocess_ppo_gae
and normalize the rewards in postprocess_ppo_gae
, than the adv will be calculated according to normalized rewards. Is this method officially recommended? Or is there any other better way for me to normalize rewards before calculating adv?