I’m a little bit confused about this, and perhaps it is my understanding about the theory that isn’t right.
When I read the implementation of
def ppo_surrogate_loss( policy: Policy, model: ModelV2, dist_class: Type[TFActionDistribution], train_batch: SampleBatch) -> Union[TensorType, List[TensorType]]: # ... if policy.config["use_gae"]: # ... total_loss = reduce_mean_valid( -surrogate_loss + policy.kl_coeff * action_kl + policy.config["vf_loss_coeff"] * vf_loss - policy.entropy_coeff * curr_entropy) # ... # Store stats in policy for stats_fn. policy._total_loss = total_loss policy._mean_policy_loss = mean_policy_loss policy._mean_vf_loss = mean_vf_loss policy._mean_entropy = mean_entropy policy._mean_kl = mean_kl return total_loss
I am surprised to see that the computation of the loss function is independent of
vf_share_layers. If policy and value networks are fully independent, shouldn’t each be trained with two different losses (i.e.,
-surrogate_loss + policy.kl_coeff * action_kl and
vf_loss respectively)? Again, I may misunderstood Proximal Policy Optimization Algorithms paper.