Possible PPO surrogate policy loss sign error

mannyv · September 19, 2022, 1:00pm

Hello,

I think there is an error in the PPO clipped surrogate loss and would appreciate if someone else could take a look and let me know if I might be missing something. This affects both the tf and torch versions.

github.com

ray-project/ray/blob/7280ef4d466a7935de8cfaa36785439733019d09/rllib/algorithms/ppo/ppo_torch_policy.py#L132-L140


      
          surrogate_loss = torch.min(
              train_batch[Postprocessing.ADVANTAGES] * logp_ratio,
              train_batch[Postprocessing.ADVANTAGES]
              * torch.clamp(
                  logp_ratio, 1 - self.config["clip_param"], 1 + self.config["clip_param"]
              ),
          )
          mean_policy_loss = reduce_mean_valid(-surrogate_loss)

github.com

ray-project/ray/blob/7280ef4d466a7935de8cfaa36785439733019d09/rllib/algorithms/ppo/ppo_torch_policy.py#L156-L160


      
          total_loss = reduce_mean_valid(
              -surrogate_loss
              + self.config["vf_loss_coeff"] * vf_loss_clipped
              - self.entropy_coeff * curr_entropy
          )

If you look at the code snippet above from master. You will see that line 139 computes the mean of the negative surrogate loss. But, on line 175 when combining the surrogate, vf, and entropy losses to get the total_loss it is negated again, which I think is incorrect.

I checked the cleanrl implementation and observed that they compute it as I expected with only one negative.

github.com

vwxyzjn/cleanrl/blob/e466f6efb251462de2b80f064133f32cb5d83e22/cleanrl/ppo.py#L275-L278


      
          # Policy loss
          pg_loss1 = -mb_advantages * ratio
          pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)
          pg_loss = torch.max(pg_loss1, pg_loss2).mean()

github.com

vwxyzjn/cleanrl/blob/e466f6efb251462de2b80f064133f32cb5d83e22/cleanrl/ppo.py#L296


      
                      -args.clip_coef,
                      args.clip_coef,
                  )
                  v_loss_clipped = (v_clipped - b_returns[mb_inds]) ** 2
                  v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)
                  v_loss = 0.5 * v_loss_max.mean()
              else:
                  v_loss = 0.5 * ((newvalue - b_returns[mb_inds]) ** 2).mean()
          
          
    entropy_loss = entropy.mean()
              loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef
          
          
    optimizer.zero_grad()
              loss.backward()
              nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
              optimizer.step()
          
          
if args.target_kl is not None:
              if approx_kl > args.target_kl:
                  break

as does stable baselines:

github.com

DLR-RM/stable-baselines3/blob/d0b129ecc3b221fd5aeacd0d44d9ee7ff7dd00ee/stable_baselines3/ppo/ppo.py#L222-L224


      
          policy_loss_1 = advantages * ratio
          policy_loss_2 = advantages * th.clamp(ratio, 1 - clip_range, 1 + clip_range)
          policy_loss = -th.min(policy_loss_1, policy_loss_2).mean()

github.com

DLR-RM/stable-baselines3/blob/d0b129ecc3b221fd5aeacd0d44d9ee7ff7dd00ee/stable_baselines3/ppo/ppo.py#L253


      
          
          
# Entropy loss favor exploration
          if entropy is None:
              # Approximate entropy when no analytical form
              entropy_loss = -th.mean(-log_prob)
          else:
              entropy_loss = -th.mean(entropy)
          
          
entropy_losses.append(entropy_loss.item())
          
          
loss = policy_loss + self.ent_coef * entropy_loss + self.vf_coef * value_loss
          
          
# Calculate approximate form of reverse KL Divergence for early stopping
          # see issue #417: https://github.com/DLR-RM/stable-baselines3/issues/417
          # and discussion in PR #419: https://github.com/DLR-RM/stable-baselines3/pull/419
          # and Schulman blog: http://joschu.net/blog/kl-approx.html
          with th.no_grad():
              log_ratio = log_prob - rollout_data.old_log_prob
              approx_kl_div = th.mean((th.exp(log_ratio) - 1) - log_ratio).cpu().numpy()
              approx_kl_divs.append(approx_kl_div)

and although it is harder to track so does seedrl (I think):

github.com

google-research/seed_rl/blob/66e8890261f09d0355e8bf5f1c5e41968ca9f02b/agents/policy_gradient/modules/policy_losses.py#L147-L171


      
          loss = -target_action_log_probs * tf.stop_gradient(advantages)
          
          
# importance sampling weights
          log_rho = target_action_log_probs - behaviour_action_log_probs
          log_rho = tf.stop_gradient(log_rho)
          if self.ppo_epsilon is not None:
            # This is written differently that the standard PPO loss but should give
            # the same gradient.
            clip_pos_mask = ((advantages > 0) &
                             (log_rho > tf.math.log(1 + self.ppo_epsilon)))
            clip_neg_mask = ((advantages < 0) &
                             (log_rho < -tf.math.log(1 + self.ppo_epsilon)))
            loss_mask = tf.cast(~(clip_pos_mask | clip_neg_mask), tf.float32)
            loss *= loss_mask
            log_rho *= loss_mask  # to avoid overflow in exp
          if self.max_importance_weight is not None:
            log_rho = tf.minimum(log_rho, tf.math.log(self.max_importance_weight))
            self.log('GeneralizedAdvantagePolicyLoss/p_clipped_iw',
                     tf.cast(log_rho == tf.math.log(self.max_importance_weight),
                             tf.float32))

This file has been truncated. show original

mannyv · September 19, 2022, 8:28pm

@Lars_Simon_Zehnder pointed out that mean_policy_loss is used for metrics and surrogate_loss is used for the loss.

I think then it is OK.

What threw me off is that most libraries I look at take the mean of each component separately and then combine them but rllib is combining them then taking the mean. So I was expecting the mean_policy_loss to be the one in total_loss and I did not read carefully enough.

Why is the mean re-computed in total_loss? Perhaps required for multi-gpu cases?

gjoliver · October 4, 2022, 6:19pm

feel like this is just bad code.
let me make a clean up PR and see how things go.

Topic		Replies	Views
Breakdown of config and metrics of PPO implementation RLlib	0	673	February 23, 2022
Monitor surrogate loss for PPO RLlib	0	10	November 19, 2024
PPO gives "Infinity" value for kl and total_loss RLlib	5	1534	October 1, 2021
Tradeoff between: clipped surrogate objective - adaptive KL-penalty coefficient RLlib	3	796	December 9, 2021
Understanding impact of vf_share_layers over loss function calculation RLlib	5	907	April 9, 2021

Related topics