Backpropagate from observation to action of previous model call

Hi there

Currently I’m researching on a MARL environment with cooperation using torch. For this i want to use the action of a model call as input (or a part there of) in a following model call. This could be implemented by using the Trajectory View API. This part of the action does not influence the reward. Because agents cooperate it can benefit from back-propagating the gradient using torch.autograd over more than one model call (action=observation). I would not mind vanishing gradients but would clip them to avoid exploding ones. This could result in rather deep models, but for training I dont mind limiting the horizon. If I’m not mistaken, tensors by default are converted to numpy arrays. Can i avoid this conversion?

Is something like this possible? Do you have an idea, on how one could implement such behavior?

Thank you for answering. Partial answers are appreciated as well. I’m happy to have a discussion.

Edit: Added more details

Hey @Sertingolix , nice question. This is a tough one. Yes, we usually convert actions directly into numpys (grads are gone). But maybe you could make your model always store the n previous action calculation tensors. You would probably want to detach these at some timesteps back so you backprop always the same number of timesteps. Then you could use these stored tensors for the next action computation.

1 Like

Thank you for answering. That seems like a good idea. I’ll try this approach. I’ll post some process when i have something. May take some time as I’ll try an auxiliary loss approach first.

You mentioned to “backprop always the same number of timesteps”. What is the reasoning behind it? Is it for faster training time, vanished gradients without use or something which I’m not aware of?