Hi, I am interested in training a reward conditioned multi agent policy and I would like to introduce an environment generator which decides the reward and initial configuration of the environment as training progresses. This is similar in spirit to [2012.02096] Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design.
I would like to do this in parallel, meaning during each round of experience collection, I would prefer each worker to run on an environment with different configurations designated by the env generator.
Does rllib support modifying the environment mid training or do I have to write a training routine from scratch? Alternatively I can make the env generator a part of the model/policy.