Any other metric other than "episode_reward_mean"

amohazab · October 15, 2024, 9:12pm

Hello everyone,
I have been using RLlib for a while. I have a multi agent setup (multi unit auction) and what I am looking for is the resulting nash equilibrium of the game.
I realized that in RLlib the main optimization target is “episode_reward_mean”, which is technically the aggregated reward of all the agents. Though, it is very different that the definition of the Nash equilibrium, where each agent tries to optimize his/her reward, not caring about the reward of the others.
How can I change the code so that each agent tries to optimize their rewards individually? In other words, we do not have one target variable: “episode_reward_mean”, but each agent has its own target variable. (sorry if the question is vague. It is a little bit confusing for me as well )

mannyv · October 16, 2024, 5:08pm

Hi @amohazab,

episode_reward_mean is just a summary statistic in rllib. Each agent is optimized using their own individual returns.

amohazab · October 16, 2024, 6:08pm

Hi @mannyv
Thanks for your response.

I tried to write my script in a way that each agent has his specific policy, policy mapping function, observation space and etc, so the target variable of each agent is ONLY his own reward values independently from the rewards of other agents. right?

But if this is the case, why we always observe a constant increase in the “episode_reward_mean” during the simulation? Isn’t it implemented in the way that it brings us the highest episode_reward_mean at the end of the simulation?

I might be wrong but the highest “episode_reward_mean” is the socially optimum solution and it is very different from the Nash equilibrium of the game.
Thanks again

mannyv · October 16, 2024, 6:51pm

@amohazab,

Here is where the summary statistics are generated.

github.com

ray-project/ray/blob/79051ff801206dc3008af9e63f893868b11c0ce5/rllib/evaluation/metrics.py#L246


      
          for k, v_list in perf_stats.copy().items():
              perf_stats[k] = np.mean(v_list)
          
          mean_connector_metrics = dict()
          for k, v_list in connector_metrics.items():
              mean_connector_metrics[k] = np.mean(v_list)
          
          return dict(
              episode_reward_max=max_reward,
              episode_reward_min=min_reward,
              episode_reward_mean=avg_reward,
              episode_len_mean=avg_length,
              episode_media=dict(episode_media),
              episodes_timesteps_total=sum(episode_lengths),
              policy_reward_min=policy_reward_min,
              policy_reward_max=policy_reward_max,
              policy_reward_mean=policy_reward_mean,
              custom_metrics=dict(custom_metrics),
              hist_stats=dict(hist_stats),
              sampler_perf=dict(perf_stats),
              num_faulty_episodes=num_faulty_episodes,

Topic		Replies	Views
Mean reward per agent in MARL RLlib	11	1103	January 12, 2023
[RLlib, Tune, PPO] episode_reward_mean based on new episodes for each iteration Configure Algorithm, Training, Evaluation, Scaling	1	27	November 25, 2024
Meaning of episode_reward_mean RLlib	10	4133	September 21, 2023
How to obtain single episode reward? RLlib	6	1421	March 19, 2024
Collecting metrics for different variation of the same experiment RLlib	7	226	January 7, 2023

Any other metric other than "episode_reward_mean"

Related topics