How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi, again thanks for this amazing project!
I don’t know if this is a bug or if I’m not understanding correctly the docs. When running PPO I’m seeing that the inner training loop (where the loss is computed) is being called an unexpected number of times.
Expected behavior
The expected behavior, which I obtain when I set num_sgd_iter=1
, is that the loss will be computed train_batch_size / sgd_minibatch_size * num_sgd_iter = 6
with my hyper-parameters:
config["train_batch_size"] = 60_000
config["sgd_minibatch_size"] = 10_000
config["num_sgd_iter"] = 1
Unexpected behavior
However, if I set a different number of sgd iterations, for example:
config["train_batch_size"] = 60_000
config["sgd_minibatch_size"] = 10_000
config["num_sgd_iter"] = 10
I observe that the inner loop is called 50 times, instead of 60. The same pattern is observed with num_sgd_iter=30
: I get 150 calls to the loss function instead of 180.
A simple way to check this, which is what I’m doing, is to increment a counter in the loss
method of PPOTorchPolicy
.
Please let me know if my understanding of the docs is incorrect or if something is indeed not working as expected.
Thanks in advance!