I am using the Exploration in conjunction with PPO for my external env and was trying to get a better handle on how that interacts with the RLlib default model (with the LSTM:True).
I see that it has:
"feature_dim": 288, # Dimensionality of the generated feature vectors. "inverse_net_hiddens": , # Hidden layers of the "inverse" model. "inverse_net_activation": "relu", # Activation of the "inverse" model. "forward_net_hiddens": , # Hidden layers of the "forward" model. "forward_net_activation": "relu", # Activation of the "forward" model.
My understanding is thus:
- Take the current observation and run it through a model (one that is added by the exploration module, and is trained inside of it) which then converts entire observation down to 288 feature vector. Is this one-hot encoded? Or 288 different values?
- Then, given the observation space used in 1) and the action, makes a different 288 feature vector as its prediction for the future observation given the current observation/action
- A little lost here, but here goes: The final, inverse net, tries to predict the action between the current observation and the next observation. Where does the next observation come from? Is it from the forward net (from step 2))? Or does this get fed in later after we actually have the next observation? The documentation also states
only used to train the “feature” net.So this inverse net is trained to guess what action is taken to get from state A to the next state B, and it’s result is used to train 1)?
I would greatly appreciate any help wrapping my mind around the “inverse net” and the questions in 1). As well as any other misunderstandings I may have!