PPO Reward Scaling

Hi All,

I have a question regarding how big should the rewards be? I currently have a reward of 1000. Then any punishments or rewards (per step and at the very end) calculated based off of that amount.

For example:

reward = 0

        if self.tinyPunish:
            self.tinyPunish = False
            reward -= firstPlace * 0.0001
        if self.smallPunish:
            self.smallPunish = False
            reward -= firstPlace * 0.001
        if self.mediumPunish:
            self.mediumPunish = False
            reward -= firstPlace * 0.01
        if self.strongPunish:
            self.strongPunish = False
            reward -= firstPlace * 0.1

…a bit further down…

 if tieredUp == 10:
            reward += firstPlace * 0.02
        elif tieredUp == 11:
            reward += firstPlace * 0.08

        if self.leveledUp:

            if (self.level > 4) and ((self.boardUnitCount() + 1) >= self.level):  # don't want to reward for rushing early levels as I think that's just dumb

                """
                Reward for getting to level: 5: 12.5
                Reward for getting to level: 6: 21.6
                Reward for getting to level: 7: 34.3
                Reward for getting to level: 8: 51.2
                Reward for getting to level: 9: 72.9
                Reward for getting to level: 10: 100.0
                """
                award = firstPlace * 0.0001 * (self.level ** 3)
                print(f"Awarded: {award} for leveling up with: {self.boardUnitCount()} heroes!")
                reward += award
                self.leveledUp = False

I was wondering if that is okay, or do I need to scale everything between 0 and 1 per step? Or make sure that rewards don’t exceed 1 per episode?

It is fine with rewards exceed 1 per episode. For reference, the Mujoco environments can have pretty large rewards, which are passed into the policy loss function.

For the PPO parameters there is this:

# PPO clip parameter.
    "clip_param": 0.3,
    # Clip param for the value function. Note that this is sensitive to the
    # scale of the rewards. If your expected V is large, increase this.
    "vf_clip_param": 100000.0

What would large constitute? And how much would I need to increase that by?