I am currently using a policy/server setup with about 4 remote workers collecting samples, and a single policy server that does the training.
I am using the built-in FCNET with LSTM wrapping + PPO, "batch_mode": "complete_episodes", . My questions is as follows:
Given the following hypothetical:
If my lstm seq_len is 16 and my minibatch is 16 and my overall buffer is 256. My episodes are all 64 timesteps (4 workers, so 4 episodes in an overall training cycle):
When doing the minibatches, will RRllib automatically grab 16 (in order) timesteps from a SINGLE episode to do an SGD pass on? Or will the minibatch consist of 16 RANDOM timesteps from any combination of the episodes?
If it is from a single episode, are the 16 timesteps in order to ensure logical time-sequence when learning or will it be any 16 timesteps from a single episode?
2.5) If the 16 are in order, will the batches of each 16 episodes from the episode be in order as well, or is the batch of 16 random? I.E. 0-15, 16-31 OR 0-15, 32-47, 16-31 etc.
I had a question where in his answer@mannyv also explained a little the minibatch setup. Maybe his answer helps you to understand better what is going on under the hood.
Hi @Lars_Simon_Zehnder ,
It was a good start, however, unless I misunderstood (high possibility of that) nothing really pertained to how LSTM’s affect the batching for mini-batches and the exact sequence of them…
Yes you will get contiguous time sequences of length 16 or shorter but with your configuration there should not be any shorter.
2 The minibatches are randomly sampled from the full training_batch. Since your max_seq_len sgd_minibatch_size match you will only perform sgd on one sequence per epoch. How many num_sgd_iters are you doing? The order of the 16 ts long subsequences will be randomly selected on each sgd iteration.
Think about it like this. Take your 256 timesteps that are ordered by time and episode and divide them into 16 groups of length 16. Within each group there will be consecutively ordered timesteps from the same episode.
During 1 iteration of training, with your settings you will train with each group as a minibatch.
My actual setup (that I am testing right now):
train_batch_size: 7680,
sgd_minibatch_size: 64,
num_sgd_iter: 10,
lstm max_seq_len: 16
What I’m hoping is that it takes an in-order (from a single episode) batch (minibatch) of 16 samples, if it is from one episode then all 4 batches (of 16) are in order, or if across different episodes then it still retains order acros iters.
However, you mentioned that the batch of 16 could be rather random - how does LSTM then learn order of sequences as a game goes on across an entire episode?
P.S.
The average episode length is ~700 steps for me
Also, does this mean in my simple example, with an num_sgd_iter=2 I would only use 32 samples before going onto the next epoch (which involves waiting for 256 timesteps - leading to a 224 timesteps wastage???)
This is the code I have (copied from slightly older rllib docs):
# Number of timesteps collected for each SGD round. This defines the size
# of each SGD epoch.
"train_batch_size": 7680,
# Total SGD batch size across all devices for SGD. This defines the
# minibatch size within each epoch.
"sgd_minibatch_size": 64,
# Number of SGD iterations in each outer loop (i.e., number of epochs to
# execute per train batch).
"num_sgd_iter": 10,
# Whether to shuffle sequences in the batch when training (recommended).
"shuffle_sequences": False,
I had read that as num_sgd_iter: 10 meaning I will go over single datapoint 10 times (so 7680/64 = 120 inner iterations). Then 10 outer ones → this is not correct then? Instead it is 10 * 64 in MY case, so I would only use 640 out of 7680 samples?? and the rest would be discarded…?
For boundaries past the max_seq_len the order does not matter. The lstm will only for a memory over that many timesteps so it does not really matter across the max_seq_len if they are in order or not.
It looks like in the code that if you are using an lstm it will shuffle the order of groups in the mini-batch and make sure that the data and the starting state are shuffled correctly together
I realise this is getting side tracked, but then my question is-> the LSTM cell_state does not get carried through the entire episode? Is there a way to make it go all the way?
You can set it to be longer than your largest possible episode length. But if that length is too long then you have to worry about vanishing gradients in bptt.
One detail. When we say reset we do not mean to the initial state. It will be whatever the state was during rollouts. It does not loose that knowledge of the state. What is reset (truncated as a more accurate term) is the gradients flowing from previous timesteps.
Riiight, that makes sense, since for training the inner cell_state is based on what it was during the moment it was sampled - the training epoch does not modify that value - hence order doesn’t matter too much beyond those gradients.
Okay its all clicking together.
Last thing (promise)
With gpu enabled computation, minibatches are stored in gpu but is the overal batch_size stored in RAM for GPU
(I also also have LZ4 compression enabled)