Transfer Learning for Multi-Agent env. with RLlib

I am having a problem with RlLib:
I trained a network and it achieved good results.
When restoring the last checkpoint, everything works fine. However, if initializing a new trainer (similar than the trained one and setting its weights equal to the trained one) I do not get good results.



preTrained_trainer = PPOTrainer(config=config_trained, env=config_trained["env"])
# Restore all policies from checkpoint.
preTrained_trainer.restore(config_checkpoint)
# Get trained weights for all policies.
trained_weights = preTrained_trainer.get_weights()

new_trainer = PPOTrainer(config=config_trained, env=config_trained["env"],)
# Set back all weights# trained weights.

new_trainer.set_weights({
    pid: w for pid, w in trained_weights.items()
})

PS: I thought of copying the filters by doing this:

# copy the filters policy_frozen are all the policies trained
for policy_name in policy_frozen:
    new_trainer.workers.local_worker().filters[policy_name] = preTrained_trainer.workers.local_worker().filters[policy_name]

However, I still have bad results.

Did I miss something? Should I set something else in addition to the weights in order to get the same trainer?

Hi @wzaielamri,

Do you have a simple but complete reproduction script you could provide?

Here is the whole code. It is a rollout episode of my mujoco environment (Ant-Agent).
The question is not directly connected to the code provided: When coping the weights of the 4 policies in “new_trainer”, the new trainer does not achieve the same results as the “preTrained_trainer”.

So the question is: what could be wrong? Is it enough to copy the weights? Are the filters also important to copy? And is there something else that should be copied in order to get similar performances?

PS: I know it is possible to restore directly the PPO trainer with the restore function. However, I want to initialize my own trainer from another one for later uses: In other words, I want to have my own customized restore function.

import ray
import pickle5 as pickle
import os
import gym
import numpy as np
from ray.tune.registry import get_trainable_cls
from ray.rllib.evaluation.worker_set import WorkerSet
from maze_envs.quantruped_centralizedController_environment import Quantruped_Centralized_Env
from ray.rllib.agents.ppo import PPOTrainer

from evaluation.rollout_episodes import rollout_episodes

"""
    Visualizing a learned (multiagent) controller,
    for evaluation or visualisation.
    
    This is adapted from rllib's rollout.py
    (github.com/ray/rllib/rollout.py)
"""

# Setting number of steps and episodes
num_steps = int(600)
num_episodes = int(1)

ray.init()

smoothness = 1

# Selecting checkpoint to load
config_checkpoints = [
    './ray_results/2_2_0_QuantrupedMultiEnv/PPO_QuantrupedMultiEnv_2c71b_00004_4_2022-02-14_18-11-40/checkpoint_002500/checkpoint-2500',
]

for config_checkpoint in config_checkpoints:
    config_dir = os.path.dirname(config_checkpoint)
    config_path = os.path.join(config_dir, "params.pkl")

    # Loading configuration for checkpoint.
    if not os.path.exists(config_path):
        config_path = os.path.join(config_dir, "../params.pkl")

    if os.path.exists(config_path):
        with open(config_path, "rb") as f:
            config_trained = pickle.load(f)


    # Starting ray and setting up ray.
    if "num_workers" in config_trained:
        config_trained["num_workers"] = min(1, config_trained["num_workers"])
    cls = get_trainable_cls('PPO')
    # Setting config values (required for compatibility between versions)
    config_trained["create_env_on_driver"] = True
    config_trained['env_config']['hf_smoothness'] = smoothness
    if "no_eager_on_workers" in config_trained:
        del config_trained["no_eager_on_workers"]


    config_trained['num_envs_per_worker'] = 1  # 4

    preTrained_trainer = PPOTrainer(config=config_trained, env=config_trained["env"])
    # Restore all policies from checkpoint.
    preTrained_trainer.restore(config_checkpoint)
    # Get trained weights for all policies.
    trained_weights = preTrained_trainer.get_weights()

    new_trainer = PPOTrainer(config=config_trained, env=config_trained["env"],)
    # Set back all weights# trained weights.

    new_trainer.set_weights({
        pid: w for pid, w in trained_weights.items()
    })

    policy_frozen=["Agent_0_policy","Agent_1_policy","Agent_2_policy","Agent_3_policy"]
    # copy the filters policy_frozen are all the policies trained
    for policy_name in policy_frozen:
        new_trainer.workers.local_worker().filters[policy_name] = preTrained_trainer.workers.local_worker().filters[policy_name]


    # Retrieve environment for the trained agent.
    if hasattr(new_trainer, "workers") and isinstance(new_trainer.workers, WorkerSet):
        env = new_trainer.workers.local_worker().env
        
    save_image_dir = "./videos/" + \
        config_checkpoint.split("/")[-4]


    # Rolling out simulation = stepping through simulation.
    reward_eps, steps_eps, dist_eps, power_total_eps, vel_eps, cot_eps = rollout_episodes(env, new_trainer, num_episodes=num_episodes,
                                                                                          num_steps=num_steps, render=True, camera_name="side_fixed", plot=False, save_images=save_image_dir+"/img_")

    new_trainer.stop()

@mannyv

So what are the portions that i am missing to get the same run results as using the restore function?
As I understood from the code of the restore function in GitHub:

  • Weights are being restored: important!
  • Filter values are being restored: important!
  • States (i.e. information) (like checkpoint number episode numbers etc): not important to copy? Correct?

So by coping the weights and filters everything should be fine

PS: Just to clarify why I need this: I want later on to restore only specific policies and change other policies with new one, that have other observation space, etc (transfer learning on specific policies). And now i am just restoring manually everything to test if the restoring function works, before proceeding to the next step.