RLlib - QMIX configurations

Hey guys!

I am trying to get a better understanding for the QMIX Algorithm.
As a first step, I want to train two agents to solve the following Environment:

(This is a modified version of https://github.com/koulanurag/ma-gym 's Switch2-v0 environment)


  • For each time step a agents is not in its final position, it receives a reward of -5
  • For each time step a agents is in its final position, it receives a reward of +5


  1. “DOWN”
  2. “LEFT”
  3. “UP”
  4. “RIGHT”
  5. “NOOP”

For each agent an obs consists of:

  1. ID of the agent
  2. Y coordinate of the agent
  3. X coordinate of the agent
  4. a current step count (for time reference)

The additional state is a concatenation of both of those observations

Here is an example observation:
{‘group_0’: [{‘obs’: [0.0, 0.0, 0.17, 0.0], ‘state’: [0.0, 0.0, 0.17, 0.0, 1.0, 0.0, 0.83, 0.0]}],
‘group_1’: [{‘obs’: [1.0, 0.0, 0.83, 0.0], ‘state’: [0.0, 0.0, 0.17, 0.0, 1.0, 0.0, 0.83, 0.0]}]}

The goal is, that both agents spend as much time as possible in their final position.
Each episode ends after 100 timesteps.

The agents should learn to “communicate” and use the corridor after each other in order to get to their final position as fast as possible.

In the RLlib docu i read: "Agent Grouping is required to leverage algorithms such as QMIX"

In my first trial, i put both agents in one group:

grouping = {
    "group_0": [0, 1]

The mean episode reward was going up, but the agents were not able to receive a mean episode reward higher than about 500.
I did a rollout and saw that the agents always took the same action at each timestep. That means only one of them could be possibly in its final position at a timestep.

In the second trial I put them in different groups:

grouping = {
    "group_0": [0],
    "group_1": [1]

but had the same result (both agents take the same action in each timestep). In this case I’am not sure, if both groups/ “logical agents” are still trained by the same mixing network of the QMIX Algo. Do you know if that is the case?

So far I didn’t specify the multiagent parameter in the config dict, so that was my next idea.
I added the following lines to the config:

"multiagent": {
   "policies": {
       # the first tuple value is None -> uses default policy
       "pol_0": (None, obs_space, act_space, {"agent_id": "group_0"}),
       "pol_1": (None, obs_space, act_space, {"agent_id": "group_1"}),
   "policy_mapping_fn": lambda agent_id: "pol_0" if agent_id == "group_0" else "pol_1"

In this case, the agents can choose different actions, but so far, they didn’t learn to not enter the corridor at the same time.
Thats why I am not sure if they are still trained by the same mixing network in this case.

Now I am not sure, if QMIX is the right algorithm for this kind of problem although it is an cooperative MARL problem.
I would be more than happy if you could give me any input, perhaps i overlooked something or didn’t understand the QMIX correctly.

I pushed the code to this git hub repo, in case you want to take a look at the code directly:

Cheers, Korbi. :blush:

1 Like