Replay_mode = lockstep for minibatch sgd

Hi!

It seems that the parameter replay_mode isn’t taken into account for RL agents that don’t use replay buffers. In particular, the policies learning from minibatch sgd such as PPO won’t gather the agent steps from the same env step in the same minibatch.

Would you be interested in a PR to modify the functions do_minibatch_sgd and minibatches to take this parameter into account? Do you have any pre-defined idea of how it should be implemented?

Thanks!
Thomas

Note: I tested with Ray 1.0.1 but looking at the files, it doesn’t seem any different in the master branch.

1 Like

Hey @thomaslecat , interesting find. You are right, this only affects sampling from a buffer.
On the other hand, for PPO, wouldn’t the train batch be “locked” anyways, b/c it’s always coming directly from the env rollouts? So you would always have the “natural” agent-ratios in that train batch, no?

Hi @sven1977 , thanks for your reply! Indeed the train batch comes directly from the env rollout so it’s “locked”, but it is then divided into smaller minibatches randomly such that these minibatches aren’t “locked” anymore.

For the context: I am trying to implement differentiable communication channels between agents using Ray’s built-in agents and custom models, hence my questions and tickets around the “lockstep” mode :grinning_face_with_smiling_eyes:

If we consider a PPO on an env with 3 agents, train_batch_size: 30 and sgd_minibatch_size: 5, we could imagine the following behaviors:

  • In replay_mode: independent (current behavior): we sample 10 env steps, get a train batch of 30 agent steps, then each minibatch is made of 5 randomly sampled agent steps.
  • In replay_mode: lockstep: we sample 30 env steps, get a train batch of 90 agent steps, them each minibatch is made of 5 randomly sampled env steps totaling 15 agent steps.