Use a custom encoder before Q model and train it with custom loss

  • High: It blocks me to complete my task.

My environment is partially observable. I have access to its ground truth state at training time, but I do not have it at test time. Therefore, I want to train a model that predicts the environment state given a stack of the last n observations.

My RL model would be therefore made of 3 components: an encoder, a decoder, and the Q values head.

  1. At training time, the encoder takes the last n observations and encodes them into a k-dimensional vector z. The decoder reconstructs the ground truth state and is trained with a MSE loss between the actual ground truth state and the predicted one. The Q values head takes as input the vector z and is trained with the common RLlib losses.

  2. At test time, we throw away the decoder, and just use the encoder and Q values head to select actions.

The n past observations are already included in the observation that the agent receives along with the current one, so that’s not an issue.

What can I do to have this custom encoder and the corresponding MSE loss? And train it together with the Q values head?