Confusing behavior in PPO training loop (train_batch_size, sgd_minibatch_size, num_sgd_iter)

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi, again thanks for this amazing project!

I don’t know if this is a bug or if I’m not understanding correctly the docs. When running PPO I’m seeing that the inner training loop (where the loss is computed) is being called an unexpected number of times.

Expected behavior
The expected behavior, which I obtain when I set num_sgd_iter=1, is that the loss will be computed train_batch_size / sgd_minibatch_size * num_sgd_iter = 6 with my hyper-parameters:

  config["train_batch_size"] = 60_000
  config["sgd_minibatch_size"] = 10_000
  config["num_sgd_iter"] = 1

Unexpected behavior
However, if I set a different number of sgd iterations, for example:

  config["train_batch_size"] = 60_000
  config["sgd_minibatch_size"] = 10_000
  config["num_sgd_iter"] = 10

I observe that the inner loop is called 50 times, instead of 60. The same pattern is observed with num_sgd_iter=30: I get 150 calls to the loss function instead of 180.

A simple way to check this, which is what I’m doing, is to increment a counter in the loss method of PPOTorchPolicy.

Please let me know if my understanding of the docs is incorrect or if something is indeed not working as expected.

Thanks in advance!


Interesting finding. Thanks for sharing.
Which version of ray are you using?
Do you get the same numbers if you set simple_optimizer=True in the config?