How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
I’m trying to use Ray for a research project with a custom environment. In my previous prototype experiments, I’ve found it useful to scale the advantage function after gathering a rollout. For every state of my environment, I have an estimate of the maximum value that can be obtained from this state. Is there a way to scale the advantage function by this value before the agent trains?
To make things more concrete, the advantage is usually computed as,
A_0 = r_0 + g * r_1 + g^2 r_2 ...
and this advantage is used in algorithms like PPO during optimization. Suppose I know that the maximum return at each time is is M_0, M_1, ...
I would like to scale the advantage as A_0/M_0, A_1/M_1, ...
before the optimization step. Is there a way to achieve this?