Updating policy_mapping_fn while using tune.run() and restoring from a checkpoint

Muff2n · June 13, 2023, 2:21pm

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

In my experiments, I wish to pre-train some agents with self-play on an environment (call it env a) before switching to training on a new environment (call it env b) where they play against each other. I therefore wish to restore the pre-trained policies to continue training on the new environment with a new policy_mapping_fn.

While I could achieve this using the algorithm.from_checkpoint() and then calling train(), I appreciate the helper functions of the ray.tune library, such as automated checkpoints and logging. Here is a pseudocode of my training process.

config = PPOConfig().environment(env=a).multi_agent(policy_mapping_fn=y).rollouts(num_rollout_workers=0)

trial = tune.run("PPO", config=config, checkpoint_at_end=True)
restore = trial.checkpoint.dir_or_data

config = config.environment(env=b).multi_agent(policy_mapping_fn=z)

tune.run(config, restore=restore)

There seems to have been a change in behaviour between rllib 2.0 and 2.4, however, because in the second call to rune.run(), the policy_mapping_fn passed in the config is no longer being used with ray 2.4. Instead, it is using the one restored from the checkpoint.

I have looked into callbacks, but I cannot find a suitable one.
Similarly, while I believe the method algorithm.from_checkpoint() would set up the algorithm correctly, allowing me to update the policy_mapping_fn, to my knowledge I cannot pass the instance to a tune.run() call.

Is there a workaround where I can keep the behaviour from ray 2.0?

kourosh · June 13, 2023, 4:20pm

Hello @Muff2n ,

One example that might be inspiring is this one:

github.com

ray-project/ray/blob/master/rllib/examples/self_play_league_based_with_open_spiel.py

"""Example showing how one can implement a league-based training workflow.

Uses the open spiel adapter of RLlib with the "markov_soccer" game and
a simplified multi-agent, league-based setup:
https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in- \
StarCraft-II-using-multi-agent-reinforcement-learning

Our league consists of three groups of policies:
- main policies: The current main policy plus prior versions of it.
- main exploiters: Trained by playing only against different "main policies".
- league exploiters: Trained by playing against any policy in the league.

We start with 1 policy from each group, setting all 3 of these to an initial
PPO policy and allowing all 3 policies to be trained.
After each train update - via our custom callback - we decide for each
trainable policy, whether to make a copy and freeze it. Frozen policies
will not be altered anymore. However, they remain in the league for
future matches against trainable policies.
Matchmaking happens via a policy_mapping_fn, which needs to be altered
after every change (addition) to the league. The mapping function

This file has been truncated. show original

It uses the on_train_result hook of the callback to add a new policy and update the policy_mapping_fn.

If you need to load the checkpoint upon initialization of the callback you can override the on_algorithm_init hook of the callbacks and update the algorithm instance on the fly.

Muff2n · June 13, 2023, 5:27pm

That is a useful example, thank you, because it shows how to update the workers.

However, I don’t know which callback is suitable. Ideally, I would do it once and only once. This would suggest on_algorithm_init, but I believe (to be confirmed) that this is called after the init(), but before restoring the training, which presumably is calling load_checkpoint(), which will then override the policy_mapping_fn.

Do you have any thoughts on this matter please?

kourosh · June 13, 2023, 8:45pm

yes. the on_algorithm_init is precisely called here:

github.com

ray-project/ray/blob/master/rllib/algorithms/algorithm.py#L732


      
                          marl_module_ckpt_dir=marl_module_ckpt_dir,
                          modules_to_load=modules_to_load,
                          rl_module_ckpt_dirs=rl_module_ckpt_dirs,
                      )
                  # sync the weights from the learner group to the rollout workers
                  weights = self.learner_group.get_weights()
                  local_worker.set_weights(weights)
                  self.workers.sync_weights()
          
          
    # Run `on_algorithm_init` callback after initialization is done.
              self.callbacks.on_algorithm_init(algorithm=self)
          
          
# TODO: Deprecated: In your sub-classes of Trainer, override `setup()`
          #  directly and call super().setup() from within it if you would like the
          #  default setup behavior plus some own setup logic.
          #  If you don't need the env/workers/config/etc.. setup for you by super,
          #  simply do not call super().setup() from your overridden method.
          def _init(self, config: AlgorithmConfigDict, env_creator: EnvCreator) -> None:
              raise NotImplementedError
          
          
@OverrideToImplementCustomLogic

So you can update the state of algorithm (which includes policy_mapping_fn) using any API that you would use otherwise. Something like this might work:


class MyCustomCallback(DefaultCallback):
     def __init__(self):
           self.checkpoint_path = "<checkpoint_path>"

     def on_algorithm_init(self, algo):
          base_algo = Algorithm.from_checkpoint(self.checkpoint_path)
          policy_map = based_algo.local_worker().policy_map
          for pid, policy in policy_map:
                # READ the API of add_policy to learn more
                algo.add_policy(pid, policy, policy_mapping_fn, ...)

Muff2n · June 14, 2023, 4:06pm

Thank you, I have managed to get this working.

csaben · July 4, 2023, 4:17pm

Hi @Muff2n ,

Do you have a code snippet of how you ended up setting up your code for the tune.run portion of this? I have a similar enva envb setup for training but my inherited DefaultCallbacks object doesn’t ever have its on_algorithm_init function run in algorithm.setup.

my current setup for this with effectively the same Callback code that is suggested is the following:


        env_b = "RockPaperScissorsCsaben"
        pmf_a = select_policy

        config_a = (
            AlgorithmConfig(algo_class=self.algorithm)
            .environment("RockPaperScissors")
            .framework(self.framework)
            .rollouts(
                num_rollout_workers=0,
                num_envs_per_worker=4,
                rollout_fragment_length=10,
            )
            .training(
                train_batch_size=200,
                gamma=0.9,
            )
            .multi_agent(
                policies={
                    "always_same": PolicySpec(policy_class=AlwaysSameHeuristic),
                    "beat_last": PolicySpec(policy_class=BeatLastHeuristic),
                    "always_slow": PolicySpec(policy_class=AlwaysTooSlowPolicy),
                    "learned": PolicySpec(
                        config=AlgorithmConfig.overrides(
                            model={"use_lstm": True},
                            framework_str=self.framework,
                        )
                    ),
                },
                policy_mapping_fn=pmf_a,
                policies_to_train=["learned"],
            )
            .reporting(metrics_num_episodes_for_smoothing=200)
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            .resources(num_gpus=int(os.environ.get("RLLIB_NUM_GPUS", "0")))
        )

        # heuristics break everything otherwise
        config_a.checkpointing(
            export_native_model_files=False, checkpoint_trainable_policies_only=True
        )

        from ray import tune

        trial_a = tune.run(
            "PPO",
            config=config_a,
            stop={"training_iteration": 1},
            checkpoint_at_end=True,
        )
        # restore_path = trial_a.checkpoint.dir_or_data
        restore_path = trial_a.get_best_checkpoint(
            trial_a.get_best_trial(),
            mode="max",
            return_path=True,
        )

        config_b = config_a.environment(env_b).multi_agent(policy_mapping_fn=pmf_a)
        from ray.rllib.algorithms.callbacks import MultiCallbacks

        from harness.callbacks import MyCustomCallback

        config_b["callbacks"] = MultiCallbacks([MyCustomCallback(restore_path)])
        from ray import air

        tuner = tune.Tuner(
            "PPO",
            param_space=config_b,
            run_config=air.RunConfig(
                stop={"training_iteration": 1},
            ),
        )

        tuner.fit()

Muff2n · July 4, 2023, 4:36pm

The callbacks are implemented in this file:

github.com

Muff2n/meltingpot/blob/ray2.4/reward_transfer/experiments.py

# Copyright 2020 DeepMind Technologies Limited.

#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Runs an example of a self-play training experiment."""

import argparse
import copy
import math
from typing import List, Optional

This file has been truncated. show original

in function run_indep.

checkpoints is a list of strings to policy checkpoints that can be loaded before tune.run is called on the second environment.

    if checkpoints is not None:
      class MyCallbacks(DefaultCallbacks):
        def __init__(self):
          super().__init__()
          self.checkpoints = checkpoints
          self.policy_mapping_fn = lambda aid, episode, worker, **kwargs: aid

        def on_algorithm_init(
            self,
            *,
            algorithm: "Algorithm",
            **kwargs,
        ) -> None:
          for checkpoint in self.checkpoints:
            policy = Policy.from_checkpoint(checkpoint)
            for p_id, p in policy.items():
              algorithm.add_policy(p_id, policy=p)

          algorithm.remove_policy(DEFAULT_POLICY_ID,
                                  policy_mapping_fn=self.policy_mapping_fn,
                                  policies_to_train=list(POLICIES.keys()))

      config = config.callbacks(MyCallbacks)

csaben · July 4, 2023, 5:42pm

Thank you! This really helped. Here is the snippet of how I ended up doing it for my use case (used a factory method for setting up the restore path).

callbacks.py

def create_callback_class(restore_path):
    class MyCustomCallbackWithRestorePath(DefaultCallbacks):
        def __init__(self):
            super().__init__()
            self.restore_path = restore_path

        # Your custom methods here...
        def on_algorithm_init(self, algorithm):
            """
            remove the 'learned' policy and replace with checkpointed one
            for curriculum learning.
            """
            from ray.rllib.policy.policy import Policy  # , PolicySpec

            policy = Policy.from_checkpoint(self.restore_path)
            algorithm.remove_policy("learned")
            for p_id, p in policy.items():
                algorithm.add_policy(p_id, policy=p)

    return MyCustomCallbackWithRestorePath

trainer.py


restore_path = trial_a.get_best_checkpoint(
    trial_a.get_best_trial(),
    mode="max",
    return_path=True,
)

config_b = (
    config_a.environment(env_b)
    .multi_agent(policy_mapping_fn=pmf_b)
    .callbacks(create_callback_class(restore_path))
)

Topic		Replies	Views
Resuming/extending rllib tune experiments Checkpointing, Restoring	4	440	November 4, 2023
Policy rollout on Ray Tune 2.0 RLlib	4	317	December 15, 2022
Tune as part of curriculum training	25	1139	February 4, 2024
Another tune after restoring a PPO algorithm Checkpointing, Restoring	2	301	December 15, 2023
Restoring RLlib Run Using Tuner.restore RLlib	5	625	February 17, 2024

Updating policy_mapping_fn while using tune.run() and restoring from a checkpoint

Related topics