The role of the discount factor gamma in policy gradient algorithms

Hello everyone,

I am struggling a little bit with the role of gamma in policy gradient algorithms like Vanilla Policy Gradient or PPO.
Most of the derivations of the policy gradient theorem that I read, assume no discounting (gamma = 1) either in the episodic case where this is no problem but also for the continuing case. In the continuing case, they reformulate the value function as the average reward.

In my understanding, the implementation here on rllib follows the derivation of the policy gradient theorem with no discounting. That means that the objective that the algorithm is maximizing is the undiscounted sum of rewards (or average reward in the continuing case) even if gamma < 1 is used to calculate the advantage A.

rllib implementation

# Calculate the vanilla PG loss based on:
# L = -E[ log(pi(a|s)) * A]
log_probs = action_dist.logp(train_batch[SampleBatch.ACTIONS])

# Save the loss in the policy object for the stats_fn below.
policy.pi_err = -torch.mean(
    log_probs * train_batch[Postprocessing.ADVANTAGES])

In my understanding gamma is then not a part of the problem specification but a parameter to adjust a bias-variance trade-off as Schulman writes in his HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION paper.

Is that right? Or what do I understand wrong? Any help is highly appreciated.

Also interested in this. In my own experiments, I think I can confirm that PPO is effectively discounting future rewards with gamma when optimizing. I confirm this because in a dynamic problem PPO converges to the exact solution with discounting. Nevertheless, I also had the same doubts in the beginning, and the fact that discounted rewards calculations are not accessible made me doubt (I created a callback calculating discounting rewards at the end of each episode).

1 Like

PPO, which uses GAE, calculates the advantages in the postprocessing function here: ray/postprocessing.py at 81b052f222f0ab8d516e7c7769f573f7111edf5b · ray-project/ray · GitHub

This includes both gamma and lambda.