How to do the reward normalization in RLlib's PPO

I find recommended way to normalize rewards in Normalize reward. The recommended way is to use callback function.

However, the execution sequence of PPO calls is as follows:

  1. postprocess_ppo_gae: calculate adv using gae or not. The adv is calculated and recorded in SampleBatch.
  2. on_postprocess_trajectory: callback called after a policy’s postprocess_fn is called.

As we can see, if I normalize the rewards in on_postprocess_trajectory, it does not affect the calculation of adv. It seems that modifying the rewards value in the on_postprocess_trajectory callback does not affect the training results.

Besides, I can custom the postprocess_ppo_gae and normalize the rewards in postprocess_ppo_gae, than the adv will be calculated according to normalized rewards. Is this method officially recommended? Or is there any other better way for me to normalize rewards before calculating adv?

1 Like