I want to nomalize reward in the on_sample_end(worker, samples) callback, but I found that advantage and target_value have been calculated in the samples object. Does this mean that the normalization of rewards in on_sample_end is meaningless? As the advantage&target_values has already been calculated, the rewards I modified did not affect the calculation of advantage.
So, What is the correct way to use reward normalization in RLlib?Should I need to recalculate and update adv&target_v in samples after reward normalization in on_sample_end() callback?
I find recommended way to normalize rewards in Normalize reward. The recommended way is to use callback function.
However, the execution sequence of PPO calls is as follows:
postprocess_ppo_gae: calculate adv using gae or not. The adv is calculated and recorded in SampleBatch.
on_postprocess_trajectory: callback called after a policy’s postprocess_fn is called.
As we can see, if I normalize the rewards in on_postprocess_trajectory, it does not affect the calculation of adv. It seems that modifying the rewards value in the on_postprocess_trajectory callback does not affect the training results.
Besides, I can custom the postprocess_ppo_gae and normalize the rewards in postprocess_ppo_gae, than the adv will be calculated according to normalized rewards. Is this method officially recommended? Or is there any other better way for me to normalize rewards before calculating adv?