How does "rollout_fragment_length" in the specification for the trainer interact with "max_seq_len" in the specification for the model?

The max_seq_len specifies which maximum sequence length the model gets during the training. rollout_fragment_length specifies how many time steps an episode is rolled out. I wonder how both configurations interact with each other for the following cases.

(1) max_seq_len > rollout_fragment_length
(2) max_seq_len < rollout_fragment_length
(3) max_seq_len == rollout_fragment_length

I have already found that the model contains at most max_seq_len time steps of the trajectories when rollout_fragment_length > max_seq_len. But what happens to the rest of the trajectories? Are the trajectories cut into pieces so that the model always gets at most max_seq_len time steps? If so, where is this implemented?

Afaik, max_seq_len chunks the batch into pieces for the backward pass. After each chunk, the state is exported for the next backward pass. So it really only affects memory usage and computational speed. Sven may know better.

Hi Lukas, hi smorad, I have a slightly different understanding of the two parameters.
The rollout_fragment_length describes how many steps a rollout worker has to do before these experiences can be collected.

Two examples:

  • Let’s say a rollout worker has zero experiences collected - it is just starting to act. If it completes an episode of m>n steps, where n is the rollout_fragment_length. It will chop this episode into pieces of length n and they will be collected by the training algorithm. It then leaves the left over experiences to be added to it’s future experiences and repeats the process.
  • In another scenario, the the rollout_worker also starts with zero experiences, but his first episode is shorter than the rollout_fragment_length. It must therefore collect more experiences before it can send a single fragment of experiences consisting of experiences from multiple episodes.

RLlib keeps track of what experiences belong to what episode for you. So later on we avoid feeding recurrent models experiences from different episodes if we do not want to do that.

max_seq_len is predominantly used in ray/rnn_sequencing.py at master · ray-project/ray · GitHub to chop sequences of size max_seq_len. You need this especially for recurrent models. As far as I know, it can be ignored if your policy is stateless.

(1) Since max_seq_len is larger than the largest fragment you have collected, RLlib will use the largest sequence length it can find in your batch - possibly rollout_fragment_length itself, if your episodes are long enough. Or you happen to have a sequence that stems from two rollouts fragments if you are “lucky” and can use your maximum sequence length.
(2) RLlib looks for the largest sequence in the batch and chops the batch into pieces of that size.
(3) Again, if your episodes are long enough, your sequences will be of size rollout_fragment_length. Otherwise they will be smaller.

The trajectories are cut into pieces and the recurrent model (or attention model?) receives sequences of at most max_seq_len timesteps. But as far as I understand it, nothing goes to waste. Smaller chunks are just padded if they need to be.

@smorad So for my understanding, this affects memory usage. But it might also affect the performance of the model, because of the padding?

1 Like

You’re right, I didn’t think about the padding but it could certainly affect the performance for certain cases.

By performance do we mean speed or gradient computations? If it is speed then I agree that padding will make is slower. If it gradients that should be fine because the padded loss values are either zeroed out or removed before the final accumulation operation.

1 Like

One additional thing to add to this is the train_batch_size, which is related.

The param train_batch_size specifies how many timesteps are passed to the training iteration.
However, if rollout_fragment_length (200 by default) is larger than train_batch_size (let’s say 20), then RLlib will still collect that 200 timesteps (but only use 20 for the training?).

Probably this is not an issue generally, but things didn’t add up for me when I tried to have some quick debugging runs with low train_batch_size.