I want to attempt transfer learning from old PPO task to a new PPO task. Only difference between the two envs is slight modification of reward functions. I am trying to follow these steps
Train an environment with first reward function. Use trainer.save() to get checkpoint_path for policy.
Instantiate trainer for same environment with second reward function (everything else in environment stayed the same). Load the saved checkpoint.
agent = ppo.PPOTrainer(config=config, env=env_class)
agent.restore(checkpoint_path)
Turn training off for every policy network layer except last one. (I want to update weights for only last layer of policy network)
Start training for âenvironment with second reward functionâ by iterating over trainable.train() calls.
Using a custom neural network model with a forward function making trainability for PPO network hidden layers false. I have seen examples of custom models (e.g. ray/rllib/examples/models at master ¡ ray-project/ray ¡ GitHub) but I am not clear on following parts: Where is default model of PPO neural network and how to access the layers of that default model in my custom model?
You can specify a custom function for applying your gradients to a policy.
Note that RLlibâs notion of a policy is not in line with the common conception:
Your PPO policy updates your model weights and you can change how these weights are applied by updating your PPO policy with a apply_gradients_fn.
The following differs depending on how you train your policy but before your start your training you should be able to provide a policy for RLlib to train. For example, by default, the PPO Trainer will use a PPO policy of your framework of choice. Iâll assume you use TF here for simplicity, the process is the same for torch:
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy
def saurabh_aroras_update_function(policy, optimizer, grads_and_vars):
(...)
PPOTFPolicy.with_updates(apply_gradients_fn=saurabh_aroras_update_function)
I have not run this code and itâs more or less off the top of my head.
You can access your model through the policy object in your update function and apply the gradients there to your liking - to the full model or only to parts of it.
How do I find if I am using TF policy or PyTorch policy?
You can decide for yourself. The PPOTrainer does not âcareâ which one you use, as they both implement the same Policy interface. Normally, you would choose via config["framework"] = "torch" | "tf". So to find out which policy you are using you can either inspect the trainer you create or have a look at your config. If you donât set config["framework"], the standard framework will be âtfâ and thus your policy will be PPOTFPolicy.
Where is documentation and example of
update_function(policy, optimizer, grads_and_vars)?
I donât think there is any (but might be wrong). You can still get the job done by having a look at how SAC uses an apply_gradients_fn.
The structure of SAC is very different from PPO, so gradient computation and application is also different from PPO. I could not find any apply_gradients function for ppo_tf_policy.py
there is compute_and_clip_gradients method.
Is there some way to I can make gradients values zero here for all hidden layers?
Easiest way be make this custom model with non trainable hidden layers is to copy the whole code from class FullyConnectedNetwork(TFModelV2) to define a custom model. This model has class inheritance structure like MyKerasModel in https://github.com/ray-project/ray/blob/master/rllib/examples/custom_keras_model.py . Then set trainable False for all âfc_â and âfc_value_â named layers. Example