I want to nomalize reward in the
on_sample_end(worker, samples) callback, but I found that advantage and target_value have been calculated in the samples object. Does this mean that the normalization of rewards in on_sample_end is meaningless? As the advantage&target_values has already been calculated, the rewards I modified did not affect the calculation of advantage.
So, What is the correct way to use reward normalization in RLlib？Should I need to recalculate and update adv&target_v in samples after reward normalization in
Any suggestion will be helpful.