[RLlib] Ray RLlib config parameters for PPO

PPO Parameter

# Size of batches collected from each worker.
    "rollout_fragment_length": 200,
    # Number of timesteps collected for each SGD round. This defines the size
    # of each SGD epoch.
    "train_batch_size": 4000,
    # Total SGD batch size across all devices for SGD. This defines the
    # minibatch size within each epoch.
    "sgd_minibatch_size": 128,
    # Whether to shuffle sequences in the batch when training (recommended).
    "shuffle_sequences": True,
    # Number of SGD iterations in each outer loop (i.e., number of epochs to
    # execute per train batch).
    "num_sgd_iter": 30,

Hi, i have a question about ppo neural net update.

I reckon that “train_batch_size” / “rollout_fragment_size” is the num of fragment and “rollout_fragment_size” means that lambda size of TD(lambda). Right?

Neural network is updated per “Train batch size”, for example, like below the figure if train batch size is 1000, updating period is 1000ts. Right?

Finally, which data are used to update neural network? Like the figure, Can the data that extract per mini batch size from train batch size be used to update? and What is sgd iter’s
detailed mean ?


1 Like

Hey @Xim_Lee ,
check out this documentation page here, where we explain all these config keys in more detail.

On the PPO-specific keys:
sgd_minibatch_size: PPO takes a train batch (of size train_batch_size) and chunks it down into n sgd_minibatch_size sized pieces. E.g. if train_batch_size=1000 and sgd_minibatch_size=100, then we create 10 “sub-sampling” pieces out of the train batch.
num_sgd_iter: The above sub-sampling pieces are then fed num_sgd_iter times to the NN for updating. So in the above example and if num_sgd_iter=30, we do 30 x 10 updates altogether on one single train batch.


Always thx, @sven1977

Additionally, i have a one question.

When i use PPO algorithm, if i don’t set-up any Model support like RNN or LSTM, does it apply automatically (for sure i input parameter with respect to fcnet;)?

If you would like to wrap your model (your custom one or RLlib’s defaults) with an a) LSTM or b) attention net, you can simply add “use_lstm” or “use_attention” to your config->model config dict.

Check out ray/rllib/models/catalog.py for more information on supported model config keys.

1 Like

Hi sven1977,

Thanks for your explanations. How exactly do you mean by “wrap your model with xxx”? I wish to append an MLP after my LSTM module. How can I achieve that with PPO?


Hey @John, adding an MLP after(!) the LSTM would require you to write an entire custom model (including the LSTM layer).

Auto-wrapping with LSTM/attention works like this:
A=[“normal” model; could be a default one picked by RLlib or a custom one]
A does not include an LSTM.
setting “use_lstm=True” will add a single(!) LSTM layer after A, such that the output of A will be fed into the LSTM (along with the LSTM’s states) and the output of the LSTM is then the logits+value data.

If you need to add more MLP layers after the LSTM, you should not use “use_lstm”, but simply define an entire monolithic custom model, which already includes the LSTM. Examples are here:

ray/rllib/examples/models/rnn_model.py (<-- has LSTM custom model examples for both tf and torch).

1 Like

@sven1977 how does it behave in the case of a RNN or LSTM? I mean e.g. if I suppose max_seq_len=20, then a train batch of size 1000 will be broken down into 50 chunks of 20 steps, so “effective batch size” would be 50.
For a sgd_minibatch_size=100, does this mean that there are 10 “sub-sampling” pieces consisting of 5 chunks with 20 steps each?

Additionally, what happens with rest the of a train batch size if sgd_minibatch_size isn’t a multiple of train_batch_size (e.g. 1000 / 128 => 7 “sub-sampling” pieces of size 128 and 1 of size 104)?


"I mean e.g. if I suppose max_seq_len=20 , then a train batch of size 1000 will be broken down into 50 chunks of 20 steps, so “effective batch size” would be 50.
Yes, that’s correct. B=50, T=20 in the above case. However, note that for attention nets (not for LSTMs), the memory “trail” could still go back further in time (e.g. if attention_memory_training = 100, the memory trail goes back -100 steps in the past prior to this chunk).

“For a sgd_minibatch_size=100, does this mean that there are 10 “sub-sampling” pieces consisting of 5 chunks with 20 steps each?”
Yes, also correct. We do a loop of n iteration (n=num_sgd_iter), in each iteration take the train_batch_size and chop it up into sgd_minibatch_size pieces, which in themselves get chopped up again into the max_seq_len time chunks. If train_batch_size is not a multiple of sgd_minibatch_size, there will be a smaller last minibatch, which doesn’t really matter I think. We could probably add a validation check for algos like PPO to warn if that’s the case.


@sven1977 Again, thanks for your explanations!

Does this mean that a potentially existing smaller last minibatch will be ignored and not used?
If so, then a train_batch_size is a mutliple of sgd_minibatch_size would be always recommendable.