How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello, I would like to understand how RLLIB calculates the policy gradient update in the case of PPO + MARL.
I am doing parameter sharing and have the following setting:
- multi-agent competitive setting (as opposed to cooperative).
- Agents = {A, B}
- Actions for each agent: choose to cooperate or choose to compete.
- Reward structure:
– If A acts alone: reward = 35
– If B acts alone: reward = 55
– If {A,B} acting together under cooperation = (50,50) → i.e., reward for each is 50.
Expected result: Agents should learn to never cooperate, since B can get more Reward acting alone.
However QUESTION: how does RLLIB does the gradient ascent?
OPTION1: if RLLIB upgrades the policy based on the SUM of the agents rewards, the reward that RLLIB will calculate is as follows:
If agents cooperate, sum of agent’s rewards is A+B reward = 50+50 = 100
==> this is the best sum overall of rewards but useless to agent B, but since the policy is coded to maximize the SUM of TOTAL AGENT REWARDS, the policy will dictate to cooperate for both A and B.
OPTION2: RLLIB maximizes the policy of each agent, regardless of the sum of rewards, since this is a MARL problem, even if we are doing parameter sharing, I will expect that each agent maximizes its own policy.
If agents act alone, sum of A+B reward = 35+55= 90
==> this is the action that maximizes the individual reward of B. So for B, the policy will dictate to not-cooperate. Even though 90 < 100.
So Agent B should never learn to cooperate.
QUESTION: Is this possible in RLLIB with parameter sharing? is RLLIB’s PPO-MARL implementation maximizing the individual’s policy?