I’m a little bit confused about this, and perhaps it is my understanding about the theory that isn’t right.

When I read the implementation of

```
def ppo_surrogate_loss(
policy: Policy, model: ModelV2, dist_class: Type[TFActionDistribution],
train_batch: SampleBatch) -> Union[TensorType, List[TensorType]]:
# ...
if policy.config["use_gae"]:
# ...
total_loss = reduce_mean_valid(
-surrogate_loss + policy.kl_coeff * action_kl +
policy.config["vf_loss_coeff"] * vf_loss -
policy.entropy_coeff * curr_entropy)
# ...
# Store stats in policy for stats_fn.
policy._total_loss = total_loss
policy._mean_policy_loss = mean_policy_loss
policy._mean_vf_loss = mean_vf_loss
policy._mean_entropy = mean_entropy
policy._mean_kl = mean_kl
return total_loss
```

I am surprised to see that the computation of the loss function is independent of `vf_share_layers`

. If policy and value networks are fully independent, shouldn’t each be trained with two different losses (i.e., `-surrogate_loss + policy.kl_coeff * action_kl`

and `vf_loss`

respectively)? Again, I may misunderstood Proximal Policy Optimization Algorithms paper.