[rllib] Modify multi agent env reward mid training

Hi, I am interested in training a reward conditioned multi agent policy and I would like to introduce an environment generator which decides the reward and initial configuration of the environment as training progresses. This is similar in spirit to [2012.02096] Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design.

I would like to do this in parallel, meaning during each round of experience collection, I would prefer each worker to run on an environment with different configurations designated by the env generator.

Does rllib support modifying the environment mid training or do I have to write a training routine from scratch? Alternatively I can make the env generator a part of the model/policy.

1 Like

Hm, good question. I haven’t done anything like that myself but tried to check the docs and think about how I’d approach it.

I currently train my RL approaches with RLlib using tune.run(), e.g.,

analysis = ray.tune.run(PPOTrainer, config=self.config, local_dir=self.train_dir, stop=stop_criteria,
                                # checkpoint every 10 iterations and at the end; keep the best 10 checkpoints
                                checkpoint_at_end=True, checkpoint_freq=10, keep_checkpoints_num=10,
                                checkpoint_score_attr='episode_reward_mean', restore=restore_path,
                                scheduler=scheduler)

Is that also what you currently do?

If I understand correctly, what you want/need is to change some setting inside your environment over time, e.g., by passing the current training step or iteration. Then you environment could implement some logic to change the reward function (and other configuration options) based on this setting.

One option that may work is to pass a custom callback to the callbacks argument of ray.tune.run().
According to the docs, tune accepts instances of ray.tune.callback.Callback.
Not sure what information instances of this callback have. If they had access to the environment instance and to the training progress, the callback could be called periodically during training and modify the environment as you want it.
What do you think?

@sven1977 Do you have any other/better idea? Or tips on whether/how the callback idea would work?

1 Like

Hi Stefan,

Thanks for the suggestion I haven’t tried to use the callback function but I did succeed at modifying my environment.

My approach is very raw:
My policy consists of 2 actors. Actor 1 is called the scenario generator and it is only allowed to play during the first step of a rollout. What it does in this step is to set the environment config. Actor2 is a centralized policy that computes actions for every agent and it is allowed to play starting from the second step in the environment.

My last point is that I didn’t use tune.run() and instead used trainer.train(), because I had an issue with tune.run() as it raises a worker exit error (error 382 if I remembered correctly) when finishes training. But I didn’t realize that trainer.train() in a for loop doesn’t automatically save my models so I lost the model except for the metrics.

2 Likes

If you try the callback approach, let me know how it works!
I’ll also try to think of alternative solutions

1 Like

Hi, stefan. Callback works in my case. Thanks for the suggestion.

1 Like

There is also a new “curriculum learning” API for RLlib:
In order to use it, your env must use the TaskSettableEnv API with the set_task, get_task, sample_task methods.
There is a simple example/test case here:
ray/rllib/examples/curriculum_learning.py that illustrates, how you can use this.
Using callbacks suggested by @stefanbschneider is also a completely valid way to do so.

1 Like

Would you be able to expand a bit on how you implemented this with the callbacks? I was thinking about this problem as well, and there are a few pain points I’ve come up against:

  • How are you getting gradient information to the builder agent since all of its actions (e.g. placing walls) should happen in the reset function of the environment? All the callback overload functions were happening on_episode_start/end/step/etc.

  • How are you dealing with the multiple action spaces for the different networks? If the builder network outputs whether to turn a cell in the map on or off, it needs a large action space, whereas a game-playing network will have a standard action space of e.g. Up, Down, Left, Right, Noop.

Hi,

Below is my callback. For now I don’t actually learn the builder (which I call the generator). I just let it sample randomly from an acceptable space of env configurations and output a list which I call gen_action_dist.

def on_episode_start(self, *, worker: "RolloutWorker", base_env: BaseEnv, 
    policies: Dict[str, Policy], episode: MultiAgentEpisode, env_index: int, **kwargs) -> None:
    
    envs = base_env.get_unwrapped()
    policy = policies["default_policy"]
    gen_action_dict = policy.generator(len(envs))
    for i, env in enumerate(envs):
        gen_config = {key: gen_action_dict[key][i] for key in list(gen_action_dict.keys())}
        env.set_config(gen_config)

If you want to actually train the builder, you will need to store the builder’s actions, which are just the initial states of the environment, and rewards, which are whatever rewards you define (such as the training losses of the player networks). You can consider storing the builder interactions in a separate dataset and call that when you train the builder.

For me the ambiguous thing is what should the builder’s rewards be. Ideally we are trying to train agents who do well in many reward settings, but we know that with some rewards the players will give low loss but in others they will give high loss. So it seems to be easy for the builder to exploit this.

I think in the original paper they use an RNN to define the environment so each time you only put one tile conditioning on the previous tile put. You train the player and the builder separately so their actions don’t overlap. For me I work in continuous space so I just adjust some parameters not individual tiles.

1 Like