How to do the reward normalization in RLlib's PPO

I want to nomalize reward in the on_sample_end(worker, samples) callback, but I found that advantage and target_value have been calculated in the samples object. Does this mean that the normalization of rewards in on_sample_end is meaningless? As the advantage&target_values has already been calculated, the rewards I modified did not affect the calculation of advantage.

So, What is the correct way to use reward normalization in RLlib?Should I need to recalculate and update adv&target_v in samples after reward normalization in on_sample_end() callback?

Any suggestion will be helpful. :grinning:


I find recommended way to normalize rewards in Normalize reward. The recommended way is to use callback function.

However, the execution sequence of PPO calls is as follows:

  1. postprocess_ppo_gae: calculate adv using gae or not. The adv is calculated and recorded in SampleBatch.
  2. on_postprocess_trajectory: callback called after a policy’s postprocess_fn is called.

As we can see, if I normalize the rewards in on_postprocess_trajectory, it does not affect the calculation of adv. It seems that modifying the rewards value in the on_postprocess_trajectory callback does not affect the training results.

Besides, I can custom the postprocess_ppo_gae and normalize the rewards in postprocess_ppo_gae, than the adv will be calculated according to normalized rewards. Is this method officially recommended? Or is there any other better way for me to normalize rewards before calculating adv?

Hi! Thank you for your sharing. I have similar problems and your ideas are helpful to me