Hi @nathanlct,
The value function look different when using the lstm wrapper than when not.
When you use the lstm wrapper there is only one set of layers going into the lstm model and the value layers use the final state coming from the lstm as the input. So instead of two heads going into the lstm you have two heads coming out of the lstm. You can see images I created of the two cases in this post What is the intended architecture of PPO vf_share_layers=False when using an LSTM.
I think this means you are going to have to add your own lstm. Luckily this is pretty straightforward with rllib. You do have to think how and where you want to do this seperation though. Are you going to add a layer before the lstm then feed the lstm a concatenation of a policy and value embedding layer? Are you going to treat them as pass through inputs and feed them after the lstm? Are you going to have two lstms? Will only your policy use an lstm and have the value function be just fc layers?