RLLib Multiagent: Load only one policy from checkpoint & Compatibility of RLLib/Tune Checkpoints

Rafael_Albert · April 21, 2021, 4:22pm

I am working in a multiagent setup with 3 agents and I want to use pretrained weights for one (and only one!) of them. In my current workflow, I run my experiments using tune.run() and would prefer to keep it that way. Is it possible to restore only weights of one agent from a checkpoint in a clean way?

One (somewhat hacky) workaround I tried was calling a function before the tune.run() call that behaves as follows

Initalize an rllib trainer1
Load the checkpoint into trainer “trainer1”
Get the weights for the agent via trainer1.get_weights(pretrain_agent)
Initalize another random trainer (“trainer2”)
Load the pretrained weights into trainer2
Save trainer2 as checkpoint → Use this checkpoint for tune.run(restore=…)

However: This didn’t end up working because I couldn’t use the checkpoint generated from rllib that way in the tune.run() call (metadata missing). Is it possible to do this? That would then solve my main problem with the proposed workflow above

sven1977 · April 27, 2021, 2:37pm

Hey @Rafael_Albert , great question. We should add an example to RLlib on how to do this.
Could you try doing the following and let me know? I’ll also PR this right now.

"""Simple example of how to restore only one of n agents from a trained
multi-agent Trainer using Ray tune.

The trick/workaround is to use an intermediate trainer that loads the
trained checkpoint into all policies and then reverts those policies
that we don't want to restore, then saves a new checkpoint, from which
tune can pick up training.

Control the number of agents and policies via --num-agents and --num-policies.
"""

import argparse
import gym
import os
import random

import ray
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.examples.env.multi_agent import MultiAgentCartPole
from ray.rllib.utils.framework import try_import_tf
from ray.rllib.utils.test_utils import check_learning_achieved

tf1, tf, tfv = try_import_tf()

parser = argparse.ArgumentParser()

parser.add_argument("--num-agents", type=int, default=4)
parser.add_argument("--num-policies", type=int, default=2)
parser.add_argument("--pre-training-iters", type=int, default=5)
parser.add_argument("--stop-iters", type=int, default=200)
parser.add_argument("--stop-reward", type=float, default=150)
parser.add_argument("--stop-timesteps", type=int, default=100000)
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument("--as-test", action="store_true")
parser.add_argument(
    "--framework", choices=["tf2", "tf", "tfe", "torch"], default="tf")

if __name__ == "__main__":
    args = parser.parse_args()

    ray.init(num_cpus=args.num_cpus or None)

    # Get obs- and action Spaces.
    single_env = gym.make("CartPole-v0")
    obs_space = single_env.observation_space
    act_space = single_env.action_space

    # Setup PPO with an ensemble of `num_policies` different policies.
    policies = {
        f"policy_{i}": (None, obs_space, act_space, {})
        for i in range(args.num_policies)
    }
    policy_ids = list(policies.keys())

    def policy_mapping_fn(agent_id):
        pol_id = random.choice(policy_ids)
        return pol_id

    config = {
        "env": MultiAgentCartPole,
        "env_config": {
            "num_agents": args.num_agents,
        },
        # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
        "num_sgd_iter": 10,
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
        },
        "framework": args.framework,
    }

    # Do some training and store the checkpoint.
    results = tune.run(
        "PPO",
        config=config,
        stop={"training_iteration": args.pre_training_iters},
        verbose=1,
        checkpoint_freq=1,
        checkpoint_at_end=True,
    )
    print("Pre-training done.")

    best_checkpoint = results.get_best_checkpoint(
        results.trials[0], mode="max")
    print(f".. best checkpoint was: {best_checkpoint}")

    # Create a new dummy Trainer to "fix" our checkpoint.
    new_trainer = PPOTrainer(config=config)
    # Get untrained weights for all policies.
    untrained_weights = new_trainer.get_weights()
    # Restore all policies from checkpoint.
    new_trainer.restore(best_checkpoint)
    # Set back all weights (except for 1st agent) to original
    # untrained weights.
    new_trainer.set_weights({
        pid: w for pid, w in untrained_weights.items()
        if pid != "policy_0"
    })
    # Create the checkpoint from which tune can pick up the
    # experiment.
    new_checkpoint = new_trainer.save()
    print(".. checkpoint to restore from (all policies reset, "
          f"except policy_0): {new_checkpoint}")

    print("Starting new tune.run")

    # Start our actual experiment.
    stop = {
        "episode_reward_mean": args.stop_reward,
        "timesteps_total": args.stop_timesteps,
        "training_iteration": args.stop_iters,
    }

    # Make sure, the non-1st policies are not updated anymore.
    config["multiagent"]["policies_to_train"] = [
        pid for pid in policy_ids if pid != "policy_0"
    ]

    results = tune.run(
        "PPO",
        stop=stop,
        config=config,
        verbose=1,
        restore=new_checkpoint,
    )

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)
    ray.shutdown()

sven1977 · April 27, 2021, 2:52pm

Here is the PR:

github.com/ray-project/ray

[RLlib] Example script for restoring 1 agent (out of n) from a checkpoint (multi-agent).

ray-project:master ← sven1977:discussion_1827_restoring_1_out_of_n_agents_from_checkpoint_with_tune

opened 02:51PM - 27 Apr 21 UTC

sven1977

+141 -2

Add an example script for restoring 1 agent (out of n) from a (multi-agent) ch…eckpoint. The remaining agents will be initially trained (and saved) but their trained states will not be restored. This is w/ using Ray tune. Also see this discussion here: https://discuss.ray.io/t/rllib-multiagent-load-only-one-policy-from-checkpoint-compatibility-of-rllib-tune-checkpoints/1827/2 ## Why are these changes needed? ## Related issue number ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

Rory · May 2, 2021, 1:11pm

Aha this is what I was after a few days ago!

It’s kinda frustrating that you have to start up a trainer class with all the overheads it entails just to get the trainer weights out. If I want to get training weights from a checkpoint whilst all the cpus are used by tune what can I do?

This is what I ended up doing:

def get_policy_weights_from_checkpoint(trainer_class, checkpoint):
    run_base_dir = os.path.dirname(checkpoint)
    config_path = os.path.join(run_base_dir, 'params.pkl')
    with open(config_path, 'rb') as f:
        config = pickle.load(f)

    config['num_workers'] = 1
    config['evaluation_num_workers'] = 0
    eval_trainer = trainer_class(env="yaniv", config=config)
    eval_trainer.load_checkpoint(checkpoint)
    weights = eval_trainer.get_policy("policy_1").get_weights()
    eval_trainer.stop()
    
    return weights

Is there a way to start a trainer without any resources? I know a3c won’t let you

sven1977 · May 4, 2021, 8:27am

Hey @Rory , I completely agree, one should be able to extract weights (of any policy and any model) from a checkpoint w/o having to start the entire trainer. What you did with num_workers=1 is a cool hack to at least somewhat alleviate this problem. But yeah, it’s currently not possible. I’ll add this valid point to our internal planning doc (we would like to go over the train/save/restore API and make it more fine-grained).

Rory · May 4, 2021, 1:39pm

No worries glad to hear its on the list of things to do! I’ve ended up just saving the latest checkpoint and then saving the models’ weights at intervals using pickle.

sven1977 · May 4, 2021, 1:50pm

Cool Not sure, but this may even be something for Tune to keep the Trainer(s) (>1 if more than 1 trials) alive after a tune.run and to return them somehow within the returned dict. Then the user would be responsible for destroying them, though, which normally shouldn’t be an issue.

sai · June 9, 2021, 12:01pm

@sven1977 Is it possible to load only one policy from checkpoint, If we are using PolicyServerInput and Policy client architecture instead of Ray Tune ?

sven1977 · June 11, 2021, 9:36am

Not if it’s an RLlib Trainer checkpoint, no.

You’d always have to go and do the workaround described above (restore entire trainer, then set each policy’s weights (to either the initial ones or some weights you have stored before)).

wzaielamri · November 24, 2021, 8:49pm

@sven1977
Thank you for your example.

I am facing the following problem:
I am using ray tune (and I would like to keep using it, if possible).
I have two policies (A and B) that I already trained, I want to restore one of them (only Policy A) and train another policy (C): C is similar to B however with another observation space.
By using your example, it is then not possible because to restore A, C should be also included in the checkpoint and I get the error that the shapes of the policy are not matching (B has other obs_shapes than C).

Is there any work around for this?

PS: I tried to load the weights of the keras model of Policy A in the init function, however in tune this does not help, and the weights will not be updated.

Topic		Replies	Views
Compute/display actions from ray.tune RLlib	10	1697	March 30, 2021
Transfer Learning for Multi-Agent env. with RLlib RLlib	4	809	September 21, 2022
How to restore a trained agent to further train it? RLlib	5	1463	August 5, 2021
Fails restoring weights #41508 RLlib	2	428	December 29, 2023
Retraining a loaded checkpoint using Tuner.fit() with different config Ray Tune	6	1286	October 25, 2022

RLLib Multiagent: Load only one policy from checkpoint & Compatibility of RLLib/Tune Checkpoints

Related topics