Understanding impact of vf_share_layers over loss function calculation

I’m a little bit confused about this, and perhaps it is my understanding about the theory that isn’t right.

When I read the implementation of

def ppo_surrogate_loss(
        policy: Policy, model: ModelV2, dist_class: Type[TFActionDistribution],
        train_batch: SampleBatch) -> Union[TensorType, List[TensorType]]:
    # ...
    if policy.config["use_gae"]:
        # ...
        total_loss = reduce_mean_valid(
            -surrogate_loss + policy.kl_coeff * action_kl +
            policy.config["vf_loss_coeff"] * vf_loss -
            policy.entropy_coeff * curr_entropy)
    # ...
    
    # Store stats in policy for stats_fn.
    policy._total_loss = total_loss
    policy._mean_policy_loss = mean_policy_loss
    policy._mean_vf_loss = mean_vf_loss
    policy._mean_entropy = mean_entropy
    policy._mean_kl = mean_kl

    return total_loss

I am surprised to see that the computation of the loss function is independent of vf_share_layers. If policy and value networks are fully independent, shouldn’t each be trained with two different losses (i.e., -surrogate_loss + policy.kl_coeff * action_kl and vf_loss respectively)? Again, I may misunderstood Proximal Policy Optimization Algorithms paper.

@sven1977 could you chime in here?

I have a related question for sven1977 that came up as I was looking through the lstm wrapper code. When lstm auto wrapping is used, I don’t think the vf_share_layers option is considered. I think they are always shared. Is that true?

As for GattiPinheiro’s question, I think the existing code works fine because the computation graph will partition the gradients to the correct layers.

How can the computation graph split the loss correctly? How can it know how much of the total loss (scalar) is due to policy and value networks? I think that the existing code works because the loss function isn’t wrong (you still want to minimize it).

It can split it correctly because the loss is not a “scalar”. It is a tf/torch tensor with “requires_grad=True”. That “scalar” records all of the operations and inputs to those operations so that it can apply automatic differentiation when backward is called to calculate the gradients. In this example, since you have independent networks and loss = surrogate_loss + vf_loss, the derivatives of the surrogate loss term wrt the vf network is 0 and vice versa for vf_loss and the policy_network.

I suggest you have a look here for more information: PyTorch Basics: Understanding Autograd and Computation Graphs

1 Like

Hey @GattiPinheiro and @mannyv , great questions. Having two completely separate networks would not prevent the joint loss (sum of policy loss + vf loss) to update both of them. What @mannyv said is correct, the loss backprops back through both networks and calculates gradients for all weight matrices involved (those of the vf network and those of the policy network).
It’s also true that when using LSTM-auto wrapping that it’s currently ignoring a possible vf_share_layers=False setting. As I commented in the other topic, this is not so trivial to fix and will require some changes to the Model API (which we are targeting for Q2 anyways).