Hello everyone,
I have been using RLlib for a while. I have a multi agent setup (multi unit auction) and what I am looking for is the resulting nash equilibrium of the game.
I realized that in RLlib the main optimization target is “episode_reward_mean”, which is technically the aggregated reward of all the agents. Though, it is very different that the definition of the Nash equilibrium, where each agent tries to optimize his/her reward, not caring about the reward of the others.
How can I change the code so that each agent tries to optimize their rewards individually? In other words, we do not have one target variable: “episode_reward_mean”, but each agent has its own target variable. (sorry if the question is vague. It is a little bit confusing for me as well )
Hi @amohazab,
episode_reward_mean is just a summary statistic in rllib. Each agent is optimized using their own individual returns.
Hi @mannyv
Thanks for your response.
I tried to write my script in a way that each agent has his specific policy, policy mapping function, observation space and etc, so the target variable of each agent is ONLY his own reward values independently from the rewards of other agents. right?
But if this is the case, why we always observe a constant increase in the “episode_reward_mean” during the simulation? Isn’t it implemented in the way that it brings us the highest episode_reward_mean at the end of the simulation?
I might be wrong but the highest “episode_reward_mean” is the socially optimum solution and it is very different from the Nash equilibrium of the game.
Thanks again
Here is where the summary statistics are generated.