[RLlib] batch size interpretation when training multiple policies

What is the meaning of hyperparameters like train_batch_size when multiple policies are being trained at once in a multi-agent RL setup? It would seem that depending on the number of agents per policy, experience is going to be collected at different rates for different policies.

For example, the cars and traffic lights in Scaling Multi-Agent Reinforcement Learning – The Berkeley Artificial Intelligence Research Blog . Imagine there are 10x as many cars under car policy 1 as under car policy 2, so 10x as many observations/actions will be generated for training car policy 1 in a given rollout.

How does RLlib handle this? Does it defer training each policy network until it has sufficient data? Or, does it train each network with smaller batches, triggered on some aggregate measure of collected experience across all policies? (e.g. the total number of car policy 1 and car policy 2 experiences, irrespective of how many of each type)

Hey @andrew-rosenfeld-ts , thanks for asking this question!

The batch sizes are measured - by default - in env steps, where one env step entails calling the env’s step() method. In the multi-agent (and multi policy) case, as well as the case where agents step at different frequencies, this may mean that one env step only contains one step for one of the policies whereas the other policy did not step.

You can change the unit of counting by setting “config->multiagent->count_steps_by” to “agent_steps” (default is “env_steps”). When using “agent_steps”, each individual agent’s step counts as one towards the batch size.

Also, the same counting unit applies to “rollout_fragment_length”.

Thanks @sven1977 , this is very helpful! I am surprised I hadn’t seen this before, but it seems it was recently introduced.

Just to be clear - if using “agent_steps”, does this mean that policies would (likely) no longer train in lockstep with each other? If so, what’s the interpretation of metrics like training iteration in Tune under this setup?

@sven1977 - Just following up on this one: I think you’re saying that agent_steps just provides a different trigger to terminate experience collection, almost like a sub-classification of truncate_episodes. I understand it triggers when a certain number of agent steps (summed across all policies) are collected, versus a certain number of environment steps.

I don’t think this addresses anything with regard to multiple policies, which is what I am interested in. I would be interested in a setting that somehow ensures that all policies have seen a minimal amount of training data before training.

A couple of different ways I could imagine that happening:

  • experience collection continues until every policy has a full batch_size of agent_steps collected. If some policies have more, that’s fine, but at least all policies have the minimum. I think one can approximate this right now by scaling up batch_size by the inverse of the fraction of the time the least frequent policy is acting. E.g. if the least frequent policy is only 20% of the agent steps, then you need to collect 5x the agent steps for it to have a full batch.
  • somehow policies don’t train in lockstep, but train upon getting a full batch of data. So one policy that sees 80% of the agent steps will update 4x as often as one that sees 20% of the agent steps. Not sure if this is possible right now.

Is my understanding correct here?

@andrew-rosenfeld-ts
I guess your first bullet point is correct, here a small illustration of my personal understanding:

|time              |  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  | ...
|-----------------------------------------------------------------------------------
|stepping agent(s) |  1  |  1  |  1  |  1  | 1, 2|  1  |  1  |  1  |  1  | 1, 2| ...

batch_size in case of count_steps_by=env_steps: 10
batch_size in case of count_steps_by=agent_steps: 12

I don’t agree with your second bullet point, because I think policies are trained “in lockstep” even if one of the polices sees 80% of steps and the other one only 20%. IMO just the amount of data samples for training differ in this case.