How to do the reward normalization in RLlib's PPO

hybug · September 8, 2021, 2:40am

I want to nomalize reward in the on_sample_end(worker, samples) callback, but I found that advantage and target_value have been calculated in the samples object. Does this mean that the normalization of rewards in on_sample_end is meaningless? As the advantage&target_values has already been calculated, the rewards I modified did not affect the calculation of advantage.

So, What is the correct way to use reward normalization in RLlib？Should I need to recalculate and update adv&target_v in samples after reward normalization in on_sample_end() callback?

Any suggestion will be helpful.

@sven1977

hybug · September 8, 2021, 9:20am

I find recommended way to normalize rewards in Normalize reward. The recommended way is to use callback function.

However, the execution sequence of PPO calls is as follows:

postprocess_ppo_gae: calculate adv using gae or not. The adv is calculated and recorded in SampleBatch.
on_postprocess_trajectory: callback called after a policy’s postprocess_fn is called.

As we can see, if I normalize the rewards in on_postprocess_trajectory, it does not affect the calculation of adv. It seems that modifying the rewards value in the on_postprocess_trajectory callback does not affect the training results.

Besides, I can custom the postprocess_ppo_gae and normalize the rewards in postprocess_ppo_gae, than the adv will be calculated according to normalized rewards. Is this method officially recommended? Or is there any other better way for me to normalize rewards before calculating adv?

ZF4444 · December 14, 2021, 11:45am

Hi! Thank you for your sharing. I have similar problems and your ideas are helpful to me

Topic		Replies	Views
Normalize reward RLlib	3	2156	December 7, 2023
How to recompute the advantage in learning (ppo) RLlib	3	705	October 5, 2021
Observation and Reward Normalization RLlib	2	618	January 7, 2023
Proper implement of reward scaling in PPO RLlib	0	329	December 17, 2020
Seeking recommendations for implementing Dual Curriculum Design in RLlib RLlib	13	656	April 11, 2023

How to do the reward normalization in RLlib's PPO

Related topics