What is the meaning of hyperparameters like
train_batch_size when multiple policies are being trained at once in a multi-agent RL setup? It would seem that depending on the number of agents per policy, experience is going to be collected at different rates for different policies.
For example, the cars and traffic lights in Scaling Multi-Agent Reinforcement Learning – The Berkeley Artificial Intelligence Research Blog . Imagine there are 10x as many cars under car policy 1 as under car policy 2, so 10x as many observations/actions will be generated for training car policy 1 in a given rollout.
How does RLlib handle this? Does it defer training each policy network until it has sufficient data? Or, does it train each network with smaller batches, triggered on some aggregate measure of collected experience across all policies? (e.g. the total number of car policy 1 and car policy 2 experiences, irrespective of how many of each type)