Use LSTM model for policy gradient multi-agent with different recurrent hidden states per agent

It seems likely that this issue—multi-agent shared LSTM policies with Connector V2, where not all agents act every step, leading to assertion errors or stuck environments—remains a challenge. The most recent discussions and documentation indicate that RLlib’s multi-agent API expects only those agents that should act next to be present in the observation dict, and Connector V2 pipelines may not yet fully handle the asynchrony of agent actions with shared LSTM state, especially when agents skip steps and then reappear, causing repeated observations and LSTM state misalignment. There is no explicit mention of a new, official fix or update for this scenario in the latest documentation or issues, and users encountering this are encouraged to carefully manage agent state and observation emission in their environments to avoid such connector/LSTM mismatches.

If you want to avoid assertion errors, you may need to implement custom logic to ensure that LSTM state and observation alignment is maintained for each agent, possibly by padding or masking, or by ensuring that the connector pipeline does not emit duplicate observations for the same agent at the same timestep. There is no clear, out-of-the-box solution for this in the current RLlib release, and the problem is not directly addressed in the latest multi-agent or connector documentation. Would you like a more detailed breakdown or code-level suggestions?

Sources:

Hint: Mention @RunLLM in the post for followups.