- High: It blocks me to complete my task.
My environment is partially observable. I have access to its ground truth state at training time, but I do not have it at test time. Therefore, I want to train a model that predicts the environment state given a stack of the last n observations.
My RL model would be therefore made of 3 components: an encoder, a decoder, and the Q values head.
At training time, the encoder takes the last n observations and encodes them into a k-dimensional vector
z. The decoder reconstructs the ground truth state and is trained with a MSE loss between the actual ground truth state and the predicted one. The Q values head takes as input the vector
zand is trained with the common RLlib losses.
At test time, we throw away the decoder, and just use the encoder and Q values head to select actions.
The n past observations are already included in the observation that the agent receives along with the current one, so that’s not an issue.
What can I do to have this custom encoder and the corresponding MSE loss? And train it together with the Q values head?