Use LSTM model for policy gradient multi-agent with different recurrent hidden states per agent

I have a custom multi-agent environment with multiple groups of agents, where I would like each group to utilize the same policy. I configured this by setting AlgorithmConfig.multi_agent(policies={...}, policy_mapping_fn=...). Having the policy remember a history past states and actions would be very useful since the optimal action for a given state in my custom environment depends on prior states and actions. Hence naturally this would lead to using a recurrent layer (use_lstm in the model config dictionary).

However, each agent within a group has a different RL environment state and action. Although I would like to have parameter-sharing between the policy used for each agent, I don’t want each agent to have the same LSTM hidden/cell states since each agent acts independently. How can I do this?

I am looking through the rllib source code but it is very convoluted :sweat_smile:
My goal is basically that the actor/policy network and the critic/value network should have separate LSTM hidden states for each agent, but still have parameter sharing during training.