How to recompute the advantage in learning (ppo)

According to this paper, recomputing the advantage can be helpful for the performance.

Hi, Sven, could you show some hints about how to do that in @sven1977
The function compute_advantages is relevant, but I am not sure where to add it.


Hey @Shanchao_Yang , thanks for your question. Allow me to kindly ask you to always direct your questions to the entire community here as many users here may know much better how to help you with your problem :slight_smile:
Advantage calculations usually happen in the “postprocess_trajectory” step. You can define a custom callback (example: ray/rllib/examples/ override the on_postprocess_trajectory method) to alter/adjust the batch that is about to be sent into your loss function and change the advantages therein.
Alternatively, you can build a new Policy class via e.g.:

MyNewPolicyCls = PPOTorchPolicy.with_updates(postprocess_trajectory_fn=[your own postprocessing function])

This will give you a new PPO-style policy, but with your postprocessing function instead of the built-in RLlib one.

1 Like

Thanks for your suggestion. :slight_smile:

Searched the github issue board and docs before finding this. I would suggest increasing the visibility of this board in the docs as docs fill-up the first pages of search results. Thanks for the solution!