Hi there
Currently I’m researching on a MARL environment with cooperation using torch. For this i want to use the action of a model call as input (or a part there of) in a following model call. This could be implemented by using the Trajectory View API. This part of the action does not influence the reward. Because agents cooperate it can benefit from back-propagating the gradient using torch.autograd over more than one model call (action=observation). I would not mind vanishing gradients but would clip them to avoid exploding ones. This could result in rather deep models, but for training I dont mind limiting the horizon. If I’m not mistaken, tensors by default are converted to numpy arrays. Can i avoid this conversion?
Is something like this possible? Do you have an idea, on how one could implement such behavior?
Thank you for answering. Partial answers are appreciated as well. I’m happy to have a discussion.
Edit: Added more details