How are minibatches spliced

Hi all,

I am currently using a policy/server setup with about 4 remote workers collecting samples, and a single policy server that does the training.

I am using the built-in FCNET with LSTM wrapping + PPO, "batch_mode": "complete_episodes", . My questions is as follows:

Given the following hypothetical:
If my lstm seq_len is 16 and my minibatch is 16 and my overall buffer is 256. My episodes are all 64 timesteps (4 workers, so 4 episodes in an overall training cycle):

  1. When doing the minibatches, will RRllib automatically grab 16 (in order) timesteps from a SINGLE episode to do an SGD pass on? Or will the minibatch consist of 16 RANDOM timesteps from any combination of the episodes?

  2. If it is from a single episode, are the 16 timesteps in order to ensure logical time-sequence when learning or will it be any 16 timesteps from a single episode?
    2.5) If the 16 are in order, will the batches of each 16 episodes from the episode be in order as well, or is the batch of 16 random? I.E. 0-15, 16-31 OR 0-15, 32-47, 16-31 etc.

Thanks in advance!

Hi @Denys_Ashikhin ,

I had a question where in his answer @mannyv also explained a little the minibatch setup. Maybe his answer helps you to understand better what is going on under the hood.

Hope this helps

Hi @Lars_Simon_Zehnder ,
It was a good start, however, unless I misunderstood (high possibility of that) nothing really pertained to how LSTM’s affect the batching for mini-batches and the exact sequence of them…

Hi @Denys_Ashikhin,

What is your rollout_fragment_length?

  1. Yes you will get contiguous time sequences of length 16 or shorter but with your configuration there should not be any shorter.

2 The minibatches are randomly sampled from the full training_batch. Since your max_seq_len sgd_minibatch_size match you will only perform sgd on one sequence per epoch. How many num_sgd_iters are you doing? The order of the 16 ts long subsequences will be randomly selected on each sgd iteration.

Think about it like this. Take your 256 timesteps that are ordered by time and episode and divide them into 16 groups of length 16. Within each group there will be consecutively ordered timesteps from the same episode.

During 1 iteration of training, with your settings you will train with each group as a minibatch.

You will do this num_sgd_iters times.

That was just a hypothetical situation. With numbers that divide nicely.

My rollout doesn’t matter since I have complete_episodes so it rolls out episode at a time → this allows for back propogation of the LSTM state across entire episode and not just rollout_fragments as per: 'rollout_fragment_length' and 'truncate_episodes' · Issue #10179 · ray-project/ray · GitHub

My actual setup (that I am testing right now):
train_batch_size: 7680,
sgd_minibatch_size: 64,
num_sgd_iter: 10,
lstm max_seq_len: 16

What I’m hoping is that it takes an in-order (from a single episode) batch (minibatch) of 16 samples, if it is from one episode then all 4 batches (of 16) are in order, or if across different episodes then it still retains order acros iters.

However, you mentioned that the batch of 16 could be rather random - how does LSTM then learn order of sequences as a game goes on across an entire episode?

P.S.
The average episode length is ~700 steps for me

Also, does this mean in my simple example, with an num_sgd_iter=2 I would only use 32 samples before going onto the next epoch (which involves waiting for 256 timesteps - leading to a 224 timesteps wastage???)

No that is not what I meant. You would go through each minibatche twice.

This is the code I have (copied from slightly older rllib docs):

    # Number of timesteps collected for each SGD round. This defines the size
    # of each SGD epoch.
    "train_batch_size": 7680,
    # Total SGD batch size across all devices for SGD. This defines the
    # minibatch size within each epoch.
    "sgd_minibatch_size": 64,
    # Number of SGD iterations in each outer loop (i.e., number of epochs to
    # execute per train batch).
    "num_sgd_iter": 10,
    # Whether to shuffle sequences in the batch when training (recommended).
    "shuffle_sequences": False,

I had read that as num_sgd_iter: 10 meaning I will go over single datapoint 10 times (so 7680/64 = 120 inner iterations). Then 10 outer ones → this is not correct then? Instead it is 10 * 64 in MY case, so I would only use 640 out of 7680 samples?? and the rest would be discarded…?

@Denys_Ashikhin,

No I was wrong about that your original understanding is correct. Sorry about that.

I update the wording of my previous post to be more accurate.

Okay good! Phew lol.

And also, is sequence shuffling = false going to mean that batches of 16 seq will be in order for each episode?

Is there a reason why we don’t want lstm seq in order for the episode?

The code is here if you want to look at it.

For boundaries past the max_seq_len the order does not matter. The lstm will only for a memory over that many timesteps so it does not really matter across the max_seq_len if they are in order or not.

It looks like in the code that if you are using an lstm it will shuffle the order of groups in the mini-batch and make sure that the data and the starting state are shuffled correctly together

I realise this is getting side tracked, but then my question is-> the LSTM cell_state does not get carried through the entire episode? Is there a way to make it go all the way?

You can set it to be longer than your largest possible episode length. But if that length is too long then you have to worry about vanishing gradients in bptt.

Got it, so to summarise:

Minibatches will have sequences in order, but the ordering of minibatches will be random.

LSTM cell_state does not get carried across minibatches during training, instead it is limited to max_seq_len (after which internal state is reset?)

1 Like

That is correct.

One detail. When we say reset we do not mean to the initial state. It will be whatever the state was during rollouts. It does not loose that knowledge of the state. What is reset (truncated as a more accurate term) is the gradients flowing from previous timesteps.

Riiight, that makes sense, since for training the inner cell_state is based on what it was during the moment it was sampled - the training epoch does not modify that value - hence order doesn’t matter too much beyond those gradients.
Okay its all clicking together.

Last thing (promise)
With gpu enabled computation, minibatches are stored in gpu but is the overal batch_size stored in RAM for GPU
(I also also have LZ4 compression enabled)