where is an estimation of the average reward for policy $\pi$, independent of starting state $S_0$.
Everything else stays exactly the same.
In my understanding, the simplest way to implemented it is to re-compute all rewards after rollout collection by subtracting the average of collected rewards. In other words, I would like to perform the following
perform rollouts (e.g., compute $\pi(a|s)$, $v(s)$ and env#step)
concat batches
compute average reward (e.g. )
re-compute rewards (e.g., )
compute GAE as usual
back prop as usual
My issue is that, by construction, step 5. occurs before step 2. and I don’t see any way to reverse them. How can I implement such algorithm with RLLib? Is that any way to overwrite PPO’s default behavior?
Before you passing in the training batch to the loss function, you have to postprocess your training batch. I’m pretty sure you will have value function predictions in the batch b/c it is in the policy code for both PPO torch and tensorflow policy. (There you can add a new key to the train batch dict)
As I explained in the question, the problem with this strategy is that not all data is provided to postprocess_fn. The SampleBatch input to postprocess_fn contains only the data from the current worker. I would like to have access to the data from all rollout workers to compute the average reward, before computing GAE.
Well, this option did cross my mind, and indeed it seems feasible. The only thing I don’t like about it is that GAE will be performed twice (before applying the average reward and then a second time after it).
I was reading the code, and I believe that the only way to work around this double computation is to overwrite the postprocess_fn as well (so it does not do any computation on the first pass on data) and then compute GAE later, after batch concatanation.