Use LSTM model for policy gradient multi-agent with different recurrent hidden states per agent

person · July 30, 2024, 2:43am

I have a custom multi-agent environment with multiple groups of agents, where I would like each group to utilize the same policy. I configured this by setting AlgorithmConfig.multi_agent(policies={...}, policy_mapping_fn=...). Having the policy remember a history past states and actions would be very useful since the optimal action for a given state in my custom environment depends on prior states and actions. Hence naturally this would lead to using a recurrent layer (use_lstm in the model config dictionary).

However, each agent within a group has a different RL environment state and action. Although I would like to have parameter-sharing between the policy used for each agent, I don’t want each agent to have the same LSTM hidden/cell states since each agent acts independently. How can I do this?

I am looking through the rllib source code but it is very convoluted
My goal is basically that the actor/policy network and the critic/value network should have separate LSTM hidden states for each agent, but still have parameter sharing during training.

kimhyun1018 · October 21, 2025, 1:53am

Is there any update on this? I’m facing a similar issue too.

RunLLM · October 21, 2025, 1:54am

Could you clarify which specific issue or topic you are referring to? There are multiple issues discussed in the data sources, such as Ray job deadlocks, OOM errors, or dashboard persistence. Please specify the context or provide more details so I can give you the most relevant update and sources.

If you are referring to Ray jobs getting stuck with idle actors and pending tasks, there have been recent fixes—users are encouraged to try the latest Ray version, as mentioned in this GitHub issue. Let me know if you want more detail or if your issue is different.

Sources:

Ray job stuck with idle actors with no tasks

Hint: Mention @RunLLM in the post for followups.

kimhyun1018 · October 23, 2025, 9:43pm

@RunLLM Just was wondering if there was any recent update from Ray on configuring a multi-agent shared LSTM policy. I’m having 1. an assertion error or 2. the env not able to progress due to the Connector V2 assigning partial actions in multi-agent settings. Since not all agents act in a step due to the Connector V2’s chunking, when an agent that didn’t act the last step acts now, then it’s LSTM gets both an old observation and a re-emitted observation (which are the same), so it has two of the same observations at t, triggering the assertion error. On the other hand, If I don’t advance the state and wait until ALL agents receive actions, this gets stuck in an indefinite loop for some reason. I was wondering if there was an update to gracefully handle this situation.

RunLLM · October 23, 2025, 9:43pm

It seems likely that this issue—multi-agent shared LSTM policies with Connector V2, where not all agents act every step, leading to assertion errors or stuck environments—remains a challenge. The most recent discussions and documentation indicate that RLlib’s multi-agent API expects only those agents that should act next to be present in the observation dict, and Connector V2 pipelines may not yet fully handle the asynchrony of agent actions with shared LSTM state, especially when agents skip steps and then reappear, causing repeated observations and LSTM state misalignment. There is no explicit mention of a new, official fix or update for this scenario in the latest documentation or issues, and users encountering this are encouraged to carefully manage agent state and observation emission in their environments to avoid such connector/LSTM mismatches.

If you want to avoid assertion errors, you may need to implement custom logic to ensure that LSTM state and observation alignment is maintained for each agent, possibly by padding or masking, or by ensuring that the connector pipeline does not emit duplicate observations for the same agent at the same timestep. There is no clear, out-of-the-box solution for this in the current RLlib release, and the problem is not directly addressed in the latest multi-agent or connector documentation. Would you like a more detailed breakdown or code-level suggestions?

Sources:

https://discuss.ray.io/t/different-step-space-for-different-agents/2988

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Sharing an LSTM cell between policies RLlib	2	400	July 1, 2021
Agent_key and policy_id mismatch on multiagent ensemble training RLlib	9	936	March 30, 2021
Multi agent use same policy RLlib	7	732	June 26, 2021
Multi agent checkpoints - KeyError: 'default_policy' RLlib	1	601	October 30, 2021
RLlib Parameter Sharing / MARL Communication RLlib	7	1645	May 14, 2021

Use LSTM model for policy gradient multi-agent with different recurrent hidden states per agent

Related topics