MAML finetune adaptation step for inference

@sven1977

While doing the inference using MAML based policy, how does the finetune adaptation step happens for a new meta-test task? How does the MAML global gradient performs the 1-step gradient update to fine tune the weights to the new meta-test task?

Also, how many steps does the agent needs to sample in meta test environment to perform the finetune gradient update step? Will this be equal to rollout fragment length?

Re-routing this question to @michaelzhiluo , who implemented this algo. :slight_smile:

1 Like

While doing the inference using MAML based policy, how does the finetune adaptation step happens for a new meta-test task?

For inference, the policy will just start from the meta-learned prior (after MAML training) and do standard RL training, which is the “adaptation step”. The agent is supposed to adapt quickly to the environment, since the purpose of meta learning is to teach the agent to adapt quickly to new environments.

How does the MAML global gradient performs the 1-step gradient update to fine tune the weights to the new meta-test task?

There is a difference between MAML training and MAML test. Fine tuning the weights has nothing to do with MAML’s global gradient.

MAML training is coverred by RLlib with our MAML agent. Here we compute the meta gradient.

For MAML test, you can take the weights from a learned agent under MAML agent and adapt the weights to the test task with another training (recommend PPOtrainer, since we implement MAML-PPO, where PPO is the inner update step).

Also, how many steps does the agent needs to sample in meta test environment to perform the finetune gradient update step? Will this be equal to rollout fragment length?

The number of steps sampled in environment depends on how many timesteps your environment horizon is (HalfCheetah is 1000 for example) times how many episodes you want to collect to compute an adaptation step for the metalearned agent. If you want your agent to fully adapt to a new test environment, I recommend 10+ adaptation steps/RL iterations.

It will be equal to rollout fragment length if you set the config to complete_episodes.

1 Like

Thanks for your detailed response, it is very helpful and it cleared a lot of my doubts.

So if I understood this correct, I should not directly use the MAML Agent trained checkpoint for the inference on a new task. But instead I should take MAML Agent trained checkpoint and use it in PPOTrainer to do further adaptation training. And I should train the PPOTrainer for around 10+ RL Iterations so that my agent is well adapted to the new task? Is this understanding correct?

I have set the config to complete_episodes, the validation check does not allow the truncate_episode mode for MAML. I have each episode of 4200 timesteps. So for my PPOTrainer to adapt, I should train it for 10 episodes i.e. 42,000 timesteps?

Will PPOTrainer be able to load the MAML trained checkpoint easily using RLlib’s API (agent.restore())? Or I need to manually load tensorflow/torch model weights and biases? Is there a code reference for Meta-Test and PPOTrainer based Adaptation steps I can refer to?

Another question: I am using MAML to do Sim2Real Transfer. When I transfer my model from simulation to reality, I will have to train the model on reality MDP as well for 10+ adaptation steps?

I was under the impression that once I deploy the fully trained (on simulator) MAML agent on reality, it will be able to adapt automatically to the reality MDP. Is this assumption incorrect? Do I always have to do the adaptation steps after real-world deployment, when doing sim2real transfer using metaRL? If you could share any references/papers which describes the process of real-world deployment of MAML/MetaRL for Sim2Real transfer, that will be incredibly helpful.

Thanks in advance.

Yes, i think this is correct understanding.

It depends on your training batch size for PPO. You can collect 10 episodes per PPO iterations so 10 iterations would be 100 episodes.

I think it should work, never tried this specifically before but have observed model weight transfer between different agents work.

1 Like