PPO Reward Scaling

Denys_Ashikhin · September 2, 2021, 3:24pm

Hi All,

I have a question regarding how big should the rewards be? I currently have a reward of 1000. Then any punishments or rewards (per step and at the very end) calculated based off of that amount.

For example:

reward = 0

        if self.tinyPunish:
            self.tinyPunish = False
            reward -= firstPlace * 0.0001
        if self.smallPunish:
            self.smallPunish = False
            reward -= firstPlace * 0.001
        if self.mediumPunish:
            self.mediumPunish = False
            reward -= firstPlace * 0.01
        if self.strongPunish:
            self.strongPunish = False
            reward -= firstPlace * 0.1

…a bit further down…

 if tieredUp == 10:
            reward += firstPlace * 0.02
        elif tieredUp == 11:
            reward += firstPlace * 0.08

        if self.leveledUp:

            if (self.level > 4) and ((self.boardUnitCount() + 1) >= self.level):  # don't want to reward for rushing early levels as I think that's just dumb

                """
                Reward for getting to level: 5: 12.5
                Reward for getting to level: 6: 21.6
                Reward for getting to level: 7: 34.3
                Reward for getting to level: 8: 51.2
                Reward for getting to level: 9: 72.9
                Reward for getting to level: 10: 100.0
                """
                award = firstPlace * 0.0001 * (self.level ** 3)
                print(f"Awarded: {award} for leveling up with: {self.boardUnitCount()} heroes!")
                reward += award
                self.leveledUp = False

I was wondering if that is okay, or do I need to scale everything between 0 and 1 per step? Or make sure that rewards don’t exceed 1 per episode?

michaelzhiluo · September 2, 2021, 5:33pm

It is fine with rewards exceed 1 per episode. For reference, the Mujoco environments can have pretty large rewards, which are passed into the policy loss function.

Denys_Ashikhin · September 3, 2021, 12:36pm

For the PPO parameters there is this:

# PPO clip parameter.
    "clip_param": 0.3,
    # Clip param for the value function. Note that this is sensitive to the
    # scale of the rewards. If your expected V is large, increase this.
    "vf_clip_param": 100000.0

What would large constitute? And how much would I need to increase that by?

Topic		Replies	Views
Proper implement of reward scaling in PPO RLlib	0	330	December 17, 2020
Wrong rewards: is there some reward normalization in PPO? Ray Tune	2	393	January 30, 2022
PPO only run several steps in one episode RLlib	1	42	September 10, 2024
How to train better Configure Algorithm, Training, Evaluation, Scaling	0	118	March 29, 2024
When run PPO,it can not calculate episode reward	0	244	August 18, 2023

PPO Reward Scaling

Related topics