The role of the discount factor gamma in policy gradient algorithms

thgehr · September 5, 2021, 7:16pm

Hello everyone,

I am struggling a little bit with the role of gamma in policy gradient algorithms like Vanilla Policy Gradient or PPO.
Most of the derivations of the policy gradient theorem that I read, assume no discounting (gamma = 1) either in the episodic case where this is no problem but also for the continuing case. In the continuing case, they reformulate the value function as the average reward.

In my understanding, the implementation here on rllib follows the derivation of the policy gradient theorem with no discounting. That means that the objective that the algorithm is maximizing is the undiscounted sum of rewards (or average reward in the continuing case) even if gamma < 1 is used to calculate the advantage A.

rllib implementation

# Calculate the vanilla PG loss based on:
# L = -E[ log(pi(a|s)) * A]
log_probs = action_dist.logp(train_batch[SampleBatch.ACTIONS])

# Save the loss in the policy object for the stats_fn below.
policy.pi_err = -torch.mean(
    log_probs * train_batch[Postprocessing.ADVANTAGES])

In my understanding gamma is then not a part of the problem specification but a parameter to adjust a bias-variance trade-off as Schulman writes in his HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION paper.

Is that right? Or what do I understand wrong? Any help is highly appreciated.

MatiasCova · September 28, 2021, 6:07pm

Also interested in this. In my own experiments, I think I can confirm that PPO is effectively discounting future rewards with gamma when optimizing. I confirm this because in a dynamic problem PPO converges to the exact solution with discounting. Nevertheless, I also had the same doubts in the beginning, and the fact that discounted rewards calculations are not accessible made me doubt (I created a callback calculating discounting rewards at the end of each episode).

michaelzhiluo · September 30, 2021, 11:27pm

PPO, which uses GAE, calculates the advantages in the postprocessing function here: ray/postprocessing.py at 81b052f222f0ab8d516e7c7769f573f7111edf5b · ray-project/ray · GitHub

This includes both gamma and lambda.

Topic		Replies	Views
Adapted GAE formula ==> PPO algorithm used to solve problems modeled as a Semi-Markov Decision Process RLlib	1	316	November 17, 2021
Entropy Regularization in PG? RLlib	9	852	September 17, 2022
Seeking recommendations for implementing Dual Curriculum Design in RLlib RLlib	13	656	April 11, 2023
PPO with Critic and no GAE RLlib	1	440	May 3, 2021
Multi agent policy optimization in competitive settings RLlib	0	329	April 20, 2023

The role of the discount factor gamma in policy gradient algorithms

Related topics