When we create a custom model as in the custom_loss_model.py, and in the custom_loss function we can return a tuple, as follows.
@override(ModelV2)
def custom_loss(self, policy_loss, loss_inputs):
"""Calculates a custom loss on top of the given policy_loss(es).
Args:
policy_loss (List[TensorType]): The list of already calculated
policy losses (as many as there are optimizers).
loss_inputs (TensorStruct): Struct of np.ndarrays holding the
entire train batch.
Returns:
List[TensorType]: The altered list of policy losses. In case the
custom loss should have its own optimizer, make sure the
returned list is one larger than the incoming policy_loss list.
In case you simply want to mix in the custom loss into the
already calculated policy losses, return a list of altered
policy losses (as done in this example below).
"""
# Get the next batch from our input files.
batch = self.reader.next()
# Define a secondary loss by building a graph copy with weight sharing.
obs = restore_original_dimensions(
torch.from_numpy(batch["obs"]).float().to(policy_loss[0].device),
self.obs_space,
tensorlib="torch")
logits, _ = self.forward({"obs": obs}, [], None)
# You can also add self-supervised losses easily by referencing tensors
# created during _build_layers_v2(). For example, an autoencoder-style
# loss can be added as follows:
# ae_loss = squared_diff(
# loss_inputs["obs"], Decoder(self.fcnet.last_layer))
print("FYI: You can also use these tensors: {}, ".format(loss_inputs))
# Compute the IL loss.
action_dist = TorchCategorical(logits, self.model_config)
imitation_loss = torch.mean(-action_dist.logp(
torch.from_numpy(batch["actions"]).to(policy_loss[0].device)))
self.imitation_loss_metric = imitation_loss.item()
self.policy_loss_metric = np.mean([l.item() for l in policy_loss])
# Add the imitation loss to each already calculated policy loss term.
# Alternatively (if custom loss has its own optimizer):
return policy_loss + [10 * self.imitation_loss]
#return [loss_ + 10 * imitation_loss for loss_ in policy_loss]
However, in the compute_gradients function in torch_policy.py file, there is an assert statement assert len(loss_out) == len(self._optimizers)
, which would error, since the loss_out is a 2-tuple, and the self._optimizers is a 1-tuple. Is this a bug?
It seems that in the custom_loss_model example, the custom model is updated both by the policy gradient and the self-defined loss (supervised loss using offline dataset). Should we only first update the custom model by the self-defined loss, then fine-tune by the policy or leave the pretrained model unchanged?
Thanks for any suggestion!