Individual training regimes in RLLib Multi-Agent

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a multi-agent system with 3 policies (let’s call them M, X, and Y). Each two timesteps first M acts stochastically, in the next timestep, depending on M’s action we either select X and Y. I want X (and Y) to be trained with a specific amount of experience.

Currently if I set train_batch_size to e.g. 4, this means that e.g. X may be trained with at most a batch of 4. If X acted 20% of the time then it will get 20% of the experience. However, given that the percentage/rate at which X acts is stochastic, there is no way of having a precise batch size for X when training it.

I should also mention (although I am not sure if this is relevant) that M is not a trainable policy.

Moreover, given that at each time-steps only one agent acts, setting count_steps_by to either agent_steps or env_steps yields the same exact count.

Finally, I put the impact as medium because you can get an expected batch size for e.g. X by estimating how often it may act on average and scaling train_batch_size up accordingly, but this would be an estimate that works in expectation/on average after a large number of batches, it is preferable to have a guaranteed exact batch size for X.

Is there a way to train the agents asynchronously (e.g. each of them has their own counter and experience)? Or perhaps can I edit the way steps are counted? Perhaps I only count steps in such a way that I train when X and Y have each at least train_batch_size experiences?

I have a similar question about asynchronous training of agents. In a similar set-up using PPO, I am not familiar with a method to trace back the number of environment interactions for example agent X had. For this, I would also like X to be trained with a specific amount of experience. I would like to have insight in/control over this number, to investigate how many interactions are required for convergence.

Currently, I estimate this number using the number of agent interactions during evaluation, but I’m unsure if the distribution of agent interactions during training and evaluation is the same.