[RLlib] Curiosity Exploration Clarification


I am using the Exploration in conjunction with PPO for my external env and was trying to get a better handle on how that interacts with the RLlib default model (with the LSTM:True).

I see that it has:

 "feature_dim": 288,  # Dimensionality of the generated feature vectors.
    "inverse_net_hiddens": [256],  # Hidden layers of the "inverse" model.
    "inverse_net_activation": "relu",  # Activation of the "inverse" model.
    "forward_net_hiddens": [256],  # Hidden layers of the "forward" model.
    "forward_net_activation": "relu",  # Activation of the "forward" model.

My understanding is thus:

  1. Take the current observation and run it through a model (one that is added by the exploration module, and is trained inside of it) which then converts entire observation down to 288 feature vector. Is this one-hot encoded? Or 288 different values?
  2. Then, given the observation space used in 1) and the action, makes a different 288 feature vector as its prediction for the future observation given the current observation/action
  3. A little lost here, but here goes: The final, inverse net, tries to predict the action between the current observation and the next observation. Where does the next observation come from? Is it from the forward net (from step 2))? Or does this get fed in later after we actually have the next observation? The documentation also states only used to train the “feature” net. So this inverse net is trained to guess what action is taken to get from state A to the next state B, and it’s result is used to train 1)?

I would greatly appreciate any help wrapping my mind around the “inverse net” and the questions in 1). As well as any other misunderstandings I may have!


Hey @Denys_Ashikhin , great question and sorry for the delay, which was caused by the question being “uncategorized”. It helps if you set a category (e.g. “RLlib”) when you post a new question. That way, we’ll find it more easily and can assign the right person to answer it.

The best way to understand what RLlib is doing here is to look at figure 2 in this paper here:

In short, RLlib simply adds the additional 3 networks (“features”, “inverse”, and “forward”) to your model (not caring about LSTM wrapping; the curiosity models do not need an LSTM).

“features” learns how to convert observations to feature vectors of size “feature_dim”.
“inverse” learns how to predict the action that was taken between “feature” and “next-feature” (two feature vectors computed from obs(t) and obs(t+1))
“forward” learns how to predict the next feature vector given the feature vector of the current observation and the actual action having been taken.

Hope this helps. :slight_smile:

1 Like

Hi Sven,

I have a (hopefuly) final follow-up question. For the feature_dim, is that discrete? That is, if the feature_dim is 256, does that mean there are only 256 values that can either be 0 or 1? Or can each of the 256 values be anything (floating/int number-esque)?