How could I implement gradient accumulation?

1. Severity of the issue: (select one)
High: Completely blocks me.

For my own research as well as recreation I would like to work with gradient accumulation. From my research it does not appear that ray/rllib supports this in the respective framework learners.

Among others I have compute_gradients and prevent optim.zero_grad for my accumulation steps and likely return an empty dict in these steps. However I am not sure if that is sufficient and what side-effects might appear.

I’ll update this post with my findings - however, I would appreciate any further input that could help me reach a implementation supporting gradient accumulation and avoid pitfalls with RLlib’s framework.


I was able to write and test a small Learner in my project: GitHub - Daraan/ray_utilities: ray & RLlib tools for unified code across different repositories. Experiments with dynamic hyperparameters that updates the gradients after multiple steps and accumulates it in between.

The file and class is a standalone and usable in any rllib+torch algorithm.

Hi @Daraan , thanks for the question and for sharing your code here. You made everything as intended: create your own Learner and then override the compute_gradients - great work!

I just flew over it and something was suspicious to me: it looks as if the zero_grad is called in the last accumulation step at which updating is happening. As a result only the last gradient is used while the history of accumulated gradients is removed.

Let me know, if this is the case or not.

1 Like

Thank you so much for your review. You are totally correct here. Tricky to test it correctly.
I’ve updated it.

Thank you!

Great, I could help! Very nice example!

Keep up the good work!