Scaling advantage after rollout

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I’m trying to use Ray for a research project with a custom environment. In my previous prototype experiments, I’ve found it useful to scale the advantage function after gathering a rollout. For every state of my environment, I have an estimate of the maximum value that can be obtained from this state. Is there a way to scale the advantage function by this value before the agent trains?

To make things more concrete, the advantage is usually computed as,

A_0 = r_0 + g * r_1 + g^2 r_2 ...

and this advantage is used in algorithms like PPO during optimization. Suppose I know that the maximum return at each time is is M_0, M_1, ... I would like to scale the advantage as A_0/M_0, A_1/M_1, ... before the optimization step. Is there a way to achieve this?