I am currently using a policy/server setup with about 4 remote workers collecting samples, and a single policy server that does the training.
I am using the built-in FCNET with LSTM wrapping + PPO,
"batch_mode": "complete_episodes", . My questions is as follows:
Given the following hypothetical:
If my lstm seq_len is 16 and my minibatch is 16 and my overall buffer is 256. My episodes are all 64 timesteps (4 workers, so 4 episodes in an overall training cycle):
When doing the minibatches, will RRllib automatically grab 16 (in order) timesteps from a SINGLE episode to do an SGD pass on? Or will the minibatch consist of 16 RANDOM timesteps from any combination of the episodes?
If it is from a single episode, are the 16 timesteps in order to ensure logical time-sequence when learning or will it be any 16 timesteps from a single episode?
2.5) If the 16 are in order, will the batches of each 16 episodes from the episode be in order as well, or is the batch of 16 random? I.E. 0-15, 16-31 OR 0-15, 32-47, 16-31 etc.
Thanks in advance!