I am struggling a little bit with the role of gamma in policy gradient algorithms like Vanilla Policy Gradient or PPO.
Most of the derivations of the policy gradient theorem that I read, assume no discounting (gamma = 1) either in the episodic case where this is no problem but also for the continuing case. In the continuing case, they reformulate the value function as the average reward.
In my understanding, the implementation here on rllib follows the derivation of the policy gradient theorem with no discounting. That means that the objective that the algorithm is maximizing is the undiscounted sum of rewards (or average reward in the continuing case) even if gamma < 1 is used to calculate the advantage A.
# Calculate the vanilla PG loss based on: # L = -E[ log(pi(a|s)) * A] log_probs = action_dist.logp(train_batch[SampleBatch.ACTIONS]) # Save the loss in the policy object for the stats_fn below. policy.pi_err = -torch.mean( log_probs * train_batch[Postprocessing.ADVANTAGES])
In my understanding gamma is then not a part of the problem specification but a parameter to adjust a bias-variance trade-off as Schulman writes in his HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION paper.
Is that right? Or what do I understand wrong? Any help is highly appreciated.