Ensemble Learner with rule-based policies


Working on a project where we would like to write several low-level rule-based policies and feed them into a top-level reinforcement learning policy to decide on an action. We’ve implemented Policy classes for each of the low-level policies that decide on an action based off the received observation.

We’ve also implemented an environment wrapper that alternates between stepping the low-level and top-level policies in a similar way to the example in https://github.com/ray-project/ray/blob/master/rllib/examples/env/windy_maze_env.py.

We would like to use one of RLLib’s provided trainers, e.g. IMPALA, to train the top-level policy, but are unsure how to set this up in tune.run() to work alongside our rule-based policies. Is this possible in any of the trainers, or would we have to implement our own Trainer class?

Later on we’d like to have several of these hierarchical agents acting in our environment so would also appreciate some advice on how to set that up.

Thanks for your help!

Hey @wumpus , thanks for the question. I think what you would like to do is quite similar to our ray/rllib/examples/hierarchical_training.py example script (using the HierarchicalWindyMazeEnv).

Am I right that in your case, though, the low-level policies are already set and don’t learn anymore and you now only want to train the higher level policy?

If yes, you could probably simply build your env analogous to the WindyMazeEnv (publishing “low_level_agent_…” IDs when low-level steps are required and the “high_level_agent” ID when the learning (IMPALA?) high level policy is required to output a new choice of low-level policy (Discrete action space with n==number of low-level policies/options)).

Your low-level policies would have to be included in the “multiagent” sub-config and “tagged” as non-learning:

            high_level_policy: (ImpalaTFPolicy, [obs-space], Discrete(n==num low-level policies), {}),
            low_level_policy_1: ([some fixed policy], [obs-space], [primitive action-space], {}),
        policies_to_train: ["high_level_policy"],  # <- makes sure all low-level policies are not trained/updated
        policy_mapping_fn: [use a similar one as in the windy-maze example script]
1 Like