I would like to know how to concat rollout batches before computing GAE.
I’m trying to change PPO to use the average reward setting instead of the discounted formulation.
In other words, I want to compute the TD-errors as
where is an estimation of the average reward for policy $\pi$, independent of starting state $S_0$.
Everything else stays exactly the same.
In my understanding, the simplest way to implemented it is to re-compute all rewards after rollout collection by subtracting the average of collected rewards. In other words, I would like to perform the following
- perform rollouts (e.g., compute $\pi(a|s)$, $v(s)$ and env#step)
- concat batches
- compute average reward (e.g. )
- re-compute rewards (e.g., )
- compute GAE as usual
- back prop as usual
My issue is that, by construction, step 5. occurs before step 2. and I don’t see any way to reverse them. How can I implement such algorithm with RLLib? Is that any way to overwrite PPO’s default behavior?