Replay_mode = lockstep for minibatch sgd

thomaslecat · July 6, 2021, 2:09pm

Hi!

It seems that the parameter replay_mode isn’t taken into account for RL agents that don’t use replay buffers. In particular, the policies learning from minibatch sgd such as PPO won’t gather the agent steps from the same env step in the same minibatch.

Would you be interested in a PR to modify the functions do_minibatch_sgd and minibatches to take this parameter into account? Do you have any pre-defined idea of how it should be implemented?

Thanks!
Thomas

Note: I tested with Ray 1.0.1 but looking at the files, it doesn’t seem any different in the master branch.

sven1977 · July 14, 2021, 9:12pm

Hey @thomaslecat , interesting find. You are right, this only affects sampling from a buffer.
On the other hand, for PPO, wouldn’t the train batch be “locked” anyways, b/c it’s always coming directly from the env rollouts? So you would always have the “natural” agent-ratios in that train batch, no?

thomaslecat · July 14, 2021, 11:29pm

Hi @sven1977 , thanks for your reply! Indeed the train batch comes directly from the env rollout so it’s “locked”, but it is then divided into smaller minibatches randomly such that these minibatches aren’t “locked” anymore.

For the context: I am trying to implement differentiable communication channels between agents using Ray’s built-in agents and custom models, hence my questions and tickets around the “lockstep” mode

thomaslecat · July 14, 2021, 11:39pm

If we consider a PPO on an env with 3 agents, train_batch_size: 30 and sgd_minibatch_size: 5, we could imagine the following behaviors:

In replay_mode: independent (current behavior): we sample 10 env steps, get a train batch of 30 agent steps, then each minibatch is made of 5 randomly sampled agent steps.
In replay_mode: lockstep: we sample 30 env steps, get a train batch of 90 agent steps, them each minibatch is made of 5 randomly sampled env steps totaling 15 agent steps.

Topic		Replies	Views
How are minibatches spliced RLlib	15	1434	November 11, 2021
Understanding train_batch_size in multiagent RL RLlib	0	360	November 22, 2021
Minibatch for APPO RLlib	2	549	January 3, 2022
Reproducibility of ray.tune with seeds RLlib	6	3043	July 26, 2022
Individual training regimes in RLLib Multi-Agent RLlib	1	122	February 16, 2024

Replay_mode = lockstep for minibatch sgd

Related topics