For context, I am doing multiagent learning with a PPO config and with the number of agents varying during training. I am also just trying to understand rllib a little bit better. When I start training (I am training manually, without tune atm), I am able to see that there are two environments being created and that both environments are adding agents during training. Why are there two environments? Also, assuming my simulation takes up a lot of resources, how would I address resource use?