[rllib] Modify multi agent env reward mid training

Hi, I am interested in training a reward conditioned multi agent policy and I would like to introduce an environment generator which decides the reward and initial configuration of the environment as training progresses. This is similar in spirit to [2012.02096] Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design.

I would like to do this in parallel, meaning during each round of experience collection, I would prefer each worker to run on an environment with different configurations designated by the env generator.

Does rllib support modifying the environment mid training or do I have to write a training routine from scratch? Alternatively I can make the env generator a part of the model/policy.

Hm, good question. I haven’t done anything like that myself but tried to check the docs and think about how I’d approach it.

I currently train my RL approaches with RLlib using tune.run(), e.g.,

analysis = ray.tune.run(PPOTrainer, config=self.config, local_dir=self.train_dir, stop=stop_criteria,
                                # checkpoint every 10 iterations and at the end; keep the best 10 checkpoints
                                checkpoint_at_end=True, checkpoint_freq=10, keep_checkpoints_num=10,
                                checkpoint_score_attr='episode_reward_mean', restore=restore_path,
                                scheduler=scheduler)

Is that also what you currently do?

If I understand correctly, what you want/need is to change some setting inside your environment over time, e.g., by passing the current training step or iteration. Then you environment could implement some logic to change the reward function (and other configuration options) based on this setting.

One option that may work is to pass a custom callback to the callbacks argument of ray.tune.run().
According to the docs, tune accepts instances of ray.tune.callback.Callback.
Not sure what information instances of this callback have. If they had access to the environment instance and to the training progress, the callback could be called periodically during training and modify the environment as you want it.
What do you think?

@sven1977 Do you have any other/better idea? Or tips on whether/how the callback idea would work?

Hi Stefan,

Thanks for the suggestion I haven’t tried to use the callback function but I did succeed at modifying my environment.

My approach is very raw:
My policy consists of 2 actors. Actor 1 is called the scenario generator and it is only allowed to play during the first step of a rollout. What it does in this step is to set the environment config. Actor2 is a centralized policy that computes actions for every agent and it is allowed to play starting from the second step in the environment.

My last point is that I didn’t use tune.run() and instead used trainer.train(), because I had an issue with tune.run() as it raises a worker exit error (error 382 if I remembered correctly) when finishes training. But I didn’t realize that trainer.train() in a for loop doesn’t automatically save my models so I lost the model except for the metrics.

1 Like

If you try the callback approach, let me know how it works!
I’ll also try to think of alternative solutions