Issue using QMIX with a custom MARL Environment

Hi there,

I am trying to run QMIX on a custom environment (built using the RLlib multi-agent environment format) using the code adapted from the Two-Step Game example. I am running into a strange issue where training will not terminate if I specify a large number (~20) of training iterations. If I specify a small number (~5) it will terminate as expected.

With a large number of iterations, the status updates are reported as usual but training will not terminate.

== Status ==
Current time: 2022-01-31 17:25:42 (running for 00:09:16.09)
Memory usage on this node: 11.5/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.03 GiB heap, 0.0/2.52 GiB objects
Result logdir: /Users/alexrutherford/ray_results/QMIX
Number of trials: 1/1 (1 RUNNING)

I’ve included the code I am using below, any help on this would be greatly appreciated!

""" QMIX for camas (name of custom env)

Drawn from Ray RLlib Two-Step example
"""

import argparse
from gym.spaces import Dict, Discrete, Tuple, MultiDiscrete
import os

import ray
from ray import tune
from ray.tune import register_env
from ray.rllib.env.multi_agent_env import ENV_STATE
from ray.rllib.policy.policy import PolicySpec
from ray.rllib.utils.test_utils import check_learning_achieved

# Import custom env
from camas_gym.envs.camas_multi_agent_env import CamasMultiAgentEnv

parser = argparse.ArgumentParser()
parser.add_argument(
    "--run",
    type=str,
    default="QMIX",
    help="The RLlib-registered algorithm to use.")
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "tfe", "torch"],
    default="tf",
    help="The DL framework specifier.")
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument(
    "--mixer",
    type=str,
    default="qmix",
    choices=["qmix", "vdn", "none"],
    help="The mixer model to use.")
parser.add_argument(
    "--as-test",
    action="store_true",
    help="Whether this script should be run as a test: --stop-reward must "
    "be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
    "--stop-iters",
    type=int,
    default=20,
    help="Number of iterations to train.")
parser.add_argument(
    "--stop-timesteps",
    type=int,
    default=50000,
    help="Number of timesteps to train.")
parser.add_argument(
    "--stop-reward",
    type=float,
    default=150.0,
    help="Reward at which we stop training.")
parser.add_argument(
    "--local-mode",
    action="store_true",
    help="Init Ray in local mode for easier debugging.")

if __name__ == "__main__":
    args = parser.parse_args()

    ray.init(num_cpus=args.num_cpus or None, local_mode=args.local_mode)

    grouping = {
        "group_0": ['agent_0'],
        "group_1": ['agent_1'],
        "group_2": ['agent_2'],
    }
    
    
    obs_space = Tuple([
        CamasMultiAgentEnv.observation_space
    ])

    
    act_space = Tuple([
        CamasMultiAgentEnv.action_space
    ])
    
    register_env(
        "camas_multi",
        lambda config: CamasMultiAgentEnv(config).with_agent_groups(
            grouping, obs_space=obs_space, act_space=act_space))

    config = {
        "rollout_fragment_length": 4,
        "train_batch_size": 32,
        "exploration_config": {
            "final_epsilon": 0.1,
        },
        "num_workers": 0,
        "mixer": args.mixer,
        "env_config": {
        },
        # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
    }
    group = True


    stop = {
        "episode_reward_mean": args.stop_reward,
        "timesteps_total": args.stop_timesteps,
        "training_iteration": args.stop_iters,
    }

    config = dict(config, **{
        "env": "camas_multi" ,
    })

    results = tune.run(args.run, stop=stop, config=config, verbose=2)

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)

    ray.shutdown()

May I ask how did you define your CamasMultiAgentEnv.observation_space? (Discrete, Dict…)

Yep absolutely, it is defined as:

obs_space: Tuple(Tuple(Discrete(38), Discrete(38), Discrete(38))) 
act_space: Tuple(Discrete(4))

For context - the environment is a topological map and the agents are tasked with navigating from a starting node to a goal node. There are 38 possible locations for the agents, including nodes and edges. At each node, 4 actions are possible (some may be invalid but for simplicity, the action space remains constant and invalid actions simply return an immediate negative reward). As such, the obs_space passed to each agent represents the joint state of the environment.

Hey @am-rutherford , could this be an issue with your env performing some very expensive computations as training continues?

Do you see the same problem when you switch out the env for say multi-agent CartPole or some other dummy env?