How to turn training off for hidden layers of default PPO network?

I want to attempt transfer learning from old PPO task to a new PPO task. Only difference between the two envs is slight modification of reward functions. I am trying to follow these steps

  1. Train an environment with first reward function. Use trainer.save() to get checkpoint_path for policy.

  2. Instantiate trainer for same environment with second reward function (everything else in environment stayed the same). Load the saved checkpoint.
    agent = ppo.PPOTrainer(config=config, env=env_class)
    agent.restore(checkpoint_path)

  3. Turn training off for every policy network layer except last one. (I want to update weights for only last layer of policy network)

  4. Start training for ‘environment with second reward function’ by iterating over trainable.train() calls.

I am stuck at 3rd step because I couldn’t understand any way to turn layer training off using rllib APIs. I looked at posts about it but solution was not clear to me (Partial freeze and partial train , How to do transfer learning with Tune or rllib api? · Issue #5620 · ray-project/ray · GitHub ). Here are the two approaches I found.

  1. Using a custom neural network model with a forward function making trainability for PPO network hidden layers false. I have seen examples of custom models (e.g. ray/rllib/examples/models at master · ray-project/ray · GitHub) but I am not clear on following parts: Where is default model of PPO neural network and how to access the layers of that default model in my custom model?

  2. Modify PPO gradient updates in a way to make gradients for internal layers 0. ray/ppo_tf_policy.py at master · ray-project/ray · GitHub How do I get started with such a modification without modifying rllib source code?

I request you to help me find recommended approach and the answers to questions associated for recommended approach.
cc:
@sven1977 @mannyv @arturn @yiwc @RickLan @rusu24edward

Hi @Saurabh_Arora,

I will try to answer the second part of your question.

  1. Modify PPO gradient updates in a way to make gradients for internal layers 0. ray/ppo_tf_policy.py at master · ray-project/ray · GitHub How do I get started with such a modification without modifying rllib source code?

You can specify a custom function for applying your gradients to a policy.
Note that RLlib’s notion of a policy is not in line with the common conception:
Your PPO policy updates your model weights and you can change how these weights are applied by updating your PPO policy with a apply_gradients_fn.

The following differs depending on how you train your policy but before your start your training you should be able to provide a policy for RLlib to train. For example, by default, the PPO Trainer will use a PPO policy of your framework of choice. I’ll assume you use TF here for simplicity, the process is the same for torch:

from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy

def saurabh_aroras_update_function(policy, optimizer, grads_and_vars):
   (...)

PPOTFPolicy.with_updates(apply_gradients_fn=saurabh_aroras_update_function)

I have not run this code and it’s more or less off the top of my head.
You can access your model through the policy object in your update function and apply the gradients there to your liking - to the full model or only to parts of it.

1 Like

Acc to ‘Extending Existing Policies’ section in How To Customize Policies — Ray 2.0.0.dev0 ,
one can do

from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy

CustomPolicy = PPOTFPolicy.with_updates(
name=“MyCustomPPOTFPolicy”,
apply_gradients_fn=saurabh_aroras_update_function)

CustomTrainer = PPOTrainer.with_updates(
default_policy=CustomPolicy)

I am trying to find answers to following:

  1. How do I find if I am using TF policy or PyTorch policy?
  2. Where is documentation and example of
    update_function(policy, optimizer, grads_and_vars)?

Hi @Saurabh_Arora,

  1. How do I find if I am using TF policy or PyTorch policy?

You can decide for yourself. The PPOTrainer does not “care” which one you use, as they both implement the same Policy interface. Normally, you would choose via config["framework"] = "torch" | "tf". So to find out which policy you are using you can either inspect the trainer you create or have a look at your config. If you don’t set config["framework"], the standard framework will be “tf” and thus your policy will be PPOTFPolicy.

  1. Where is documentation and example of
    update_function(policy, optimizer, grads_and_vars)?

I don’t think there is any (but might be wrong). You can still get the job done by having a look at how SAC uses an apply_gradients_fn.

Thanks for response @arturn

The structure of SAC is very different from PPO, so gradient computation and application is also different from PPO. I could not find any apply_gradients function for ppo_tf_policy.py

there is compute_and_clip_gradients method.
Is there some way to I can make gradients values zero here for all hidden layers?

@mannyv @sven1977 , any thoughts on my original question?

I solved this issue by finding, replicating, and modifying the code for default model used by PPO. As per Models, Preprocessors, and Action Distributions — Ray 1.11.0, RLlib will pick a default model based on simple heuristics:

  • A vision network (TF or Torch) for observations that have a shape of length larger than 2, for example, (84 x 84 x 3) .
  • A fully connected network (TF or Torch) for everything else.

My observation shape X by 1. So length of shape is 2 . So ppo call should pick second network ray/fcnet.py at master · ray-project/ray · GitHub class FullyConnectedNetwork(TFModelV2)

Easiest way be make this custom model with non trainable hidden layers is to copy the whole code from class FullyConnectedNetwork(TFModelV2) to define a custom model. This model has class inheritance structure like MyKerasModel in ray/custom_keras_model.py at master · ray-project/ray · GitHub . Then set trainable False for all ‘fc_’ and ‘fc_value_’ named layers. Example

last_layer = tf.keras.layers.Dense(
size,
name=“fc_{}”.format(i),
activation=activation,
kernel_initializer=normc_initializer(1.0),
trainable=False
)(last_layer)

Counting parameters in output of policy.model.base_model.summary() can verify which all layers are set non trainable.

for using this custom transfer learning model, I use logic from Models, Preprocessors, and Action Distributions — Ray 1.11.0

ModelCatalog.register_custom_model(“my_torch_model”, CustomTorchModel)
trainer = ppo.PPOTrainer(env=“CartPole-v0”, config={
“framework”: “torch”,
“model”: {
“custom_model”: “my_torch_model”,
# Extra kwargs to be passed to your model’s c’tor.
“custom_model_config”: {},
},
})

cc: @arturn

1 Like