I am trying to train a multi-agent reinforcement learning, where I wish for each timestep, save a tuple of rewards for each of the agents. I wish to have the model so that when I input the observation vector, I get a matrix of dimension N x F’ where N is the number of agents and F’ is the hidden state for each of the agents.

When I collect a batch, I wish to compute a batched loss for each of the agents.

I was wondering is there a way for me to specify the way that the batched loss are computed?