In simple words, I have five shared-parameter agents and I want to concat a team-wise feature, which is maxpooled on all agents’ FC outputs, to each individual FC output as a means to share information across the team.
My environment is a MultiAgentEnv, meaning that I do have access to all agents’ observations during each step call. Therefore, I can perhaps share the observations among agents by adding all agents’ observations to individual agents’ observation dict. However, this would be terribly inefficient because I will need to repeatedly perform model inferences on the same shared observations in every agent’s forward call.
Another way is perhaps centralized execution. But I wish to do it in a decentralized fashion because MultiAgentBatch is very handy to use. I like how it can handle early exiting agents automatically.
Any idea on how to better do this? Thanks in advance.
Have you tried creating a custom model?
Since all the agents are sharing a same policy, you will get the obs from all agents in a single SampleBatch.
You can then run them through the first model, split output by input env_id column, average output by env, concatenate per-env average to the right agents, and run everything through a second model.
hi @gjoliver, thanks for your help but it is still not very clear to me. Though my env is a MultiAgentEnv, and each agent uses the same key in the policy mapping dict. But each agent would only have access to individual observations during the forward call, otherwise you would need a centralized model (or super-agent that takes all observations and outputs all actions).
super agent is another way to do this.
what I meant was actually creating a custom model, and handle data in the batch dimension yourself.
normally, a model would simply run the whole batch of data through the NN.
in your case, you will split the batch by episodes, fix the data based on obs from same episode, before running the modified batch through NN.
it’s still not simple …
Yes, it is not that simple if I stick with the current PPO algorithm.
I did eventually compromise and used the multi-agent solution, which performs repetitive inference operations but is much easier to write. I stuffed all agents’ observations into every agent, just need to be extra careful in organizing the observations so that they don’t misalign during the forward calls. It’s not an easy process and hard to describe in a few sentences since it involves both Dict and Tuple observations.
However the learn throughput is about 4 times lower than without the pooling.
Yet, the superagent model faces new problems. First, you will have to write a custom action distribution since action_maskings are involved. Second, you need to write a new loss function to split rewards and advantages in a custom policy. It’s a lot more engineering involved.
I begin to wonder if I should write a custom traine_step / execution plan to tackle this problem. But I have never dared to touch that part of the RLlib library. Execution Plan API always seems formidable to me.
Appreciate the hard work. Completely agree that starting with a simpler solution before trying to optimize performance is the right thing to do.
I don’t know if algorithm / training_step is the right thing to customize here. When I was reading your question for the first time, I thought we may need a custom sampler here, that handles this 2-step inference for episode rollout.
Sampler though is also a beast by itself.
We are currently working on the next generation policy/model APIs that will supposedly allow you to customize everything easily. Maybe this will be easier to implement then.
Thanks again @gjoliver . I was confused at first because I didn’t how forward call works in a shared policy multi-agent environment. I thought all agents separately call the forward function, but instead forward is called once and all agents’ observations are concat together along the first dimension to form a batch.
So multi-agent batch might be the way to go, there isn’t a need to build a super-agent model.