Multi-Agent Training with Different Algorithms

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

(Severity is both none and high - at the moment I’m just curious, but depending on the answer this might completely block me from doing something I wanted to do with rllib.)

I have a quick question on the two-algorithm multiagent example ray/multi_agent_two_trainers.py at master · ray-project/ray · GitHub

I see that this creates separate DQN and PPO trainers, and then in line 154 it calls result_dqn = dqn.train() and in line 159 it calls result_ppo = ppo.train(). If I understand the Trainer class correctly, each Trainer object creates its own environment(s), and then Trainer.train() collect experiences (in addition to learning from them), is that right?

So is my understanding of the entire example correct that
(1) each trainer actually has its own copy of the environment,
(2) each trainer separately collects a different set of experience from its own private environment, and only this trainer trains on those experiences,
(3) there is actually duplicate copies of each neural network: in the DQN env, there is DQN agents (which are learning) as well as PPO agents (which are not learning, and are only queried for actions to generate experiences for the DQN trainer), and similarly in the PPO env there are learning PPO agents and non-learning DQN agents, and
(4) that’s why in line 172-173 we need to synchronise weights, so that the non-learning agents in each env are reasonably up-to-date?

1 Like

Hi Matthias! :wave:t3:

Would you like to ask your question in RLlib Office Hours? :writing_hand:t3: Just add your question to this doc: RLlib Office Hours - Google Docs

Thanks! Hope to see you there!

I implemented a multi-trainer setup using the client-server framework that works a little differently than the rllib example. In my implementation, the trainers don’t alternate training: they both train on the same environment. Basically I have a single client script and two server scripts. My client script has two clients which connect to the servers on different ports. I generate actions from the respective clients, and that allows me (theoretically) to train with two algorithms on the same simulation instance.

I’m still working out some of the bugs, but I can share some pseudo-code if you’re interested in trying it out yourself.

2 Likes

Hi Edward, that would be super interesting! Are you basically using external environments? Or something else? Any code, even pseudocode, you could share would be much appreciated!

I started with the Cartpole Client-Server example. I’m doing a multi-agent game where I want one algorithm controlling some agents and another algorithm controlling others. These agents interact in a common game (think agent in the agent-based-simulation setting and not agent=rl agorithm).

The client:

    # Make two client objects, one for each server. Make sure they have different ports
    client_1 = PolicyClient('http://localhost:9900', inference_mode="local")
    client_2 = PolicyClient('http://localhost:9910', inference_mode="local")

    # Import your sim. I am using custom MultiAgentEnv simulations that I made myself, but this should work with any of the RLlib environments that are able to work with client-server
    import sim # Pseudocode ;)

    for episodes:
        # Start data generation
        obs = sim.reset()
        # Each client has to start the episode
        eid_1 = client_1.start_episode(training_enabled=True)
        eid_2 = client_2.start_episode(training_enabled=True)
        for steps in episode:
            # Combine actions from two servers. Notice that I only give client 1 the observations
            # that are associated with that server, so that it only reports actions for those agents.
            # This is important because everything has to sync up correctly. 
            action = {
                **client_1.get_action(eid_1, obs[agents_with_client_1])
                **client_2.get_action(eid_2, obs[agents_with_client_2])
            }
            # The actions are passed to the simulation, so all the agents are interacting in each step
            obs, reward, done, info = sim.step(action)
            client_1.log_returns(eid_1, reward[agents_with_client_1])
            client_2.log_returns(eid_2, reward[agents_with_client_2])
            if done['__all__']:
                break
        client_1.end_episode(eid_1, obs[agents_with_client_1])
        client_2.end_episode(eid_2, obs[agents_with_client_2])

Server 1:

    # Server 1 will train with ppo
    from ray.rllib.agents import ppo

    def _input(ioctx):
        if ioctx.worker_index > 0 or ioctx.worker.num_workers == 0:
            return PolicyServerInputMA(
                ioctx,
                "localhost",
                9900,
                idle_timeout=3.0
            )
        # No InputReader (PolicyServerInput) needed.
        else:
            return None

    # Here I have this algorithm training multiple policies
    policies = {
        'policy_1': (None, observation_space, action_space, {}),
        'policy_2': (None, observation_space, action_space, {})
    }

    def policy_mapping_fn(agent_id):
        ... # This is straightforward, just map agent ids to policy id as you would do for any multi-agent game

    config={
        # Use the connector server to generate experiences.
        "input": _input,
        "env": None,
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "count_steps_by": "agent_steps"
        },
        "num_workers": 0,
    }

    trainer = ppo.PPOTrainer(config=config)

    # Serving and training loop.
    ts = 0
    for _ in range(args.stop_iters):
        results = trainer.train()
        print(pretty_print(results))
        if results["episode_reward_mean"] >= args.stop_reward or ts >= args.stop_timesteps:
            break
        ts += results["timesteps_total"]

Server 2:

    # Server 2 will train with a2c
    from ray.rllib.agents import a2c

    def _input(ioctx):
        if ioctx.worker_index > 0 or ioctx.worker.num_workers == 0:
            return PolicyServerInputMA(
                ioctx,
                "localhost",
                9900,
                idle_timeout=3.0
            )
        # No InputReader (PolicyServerInput) needed.
        else:
            return None

    # This algorithm just trains a single policy, but I am gonna use the multi-agent setup so that the
    # expected inputs and outputs follow the format (dict of agent ids as keys and obs/actions as
    #  values)
    policies = {
        'policy_1': (None, observation_space, action_space, {}),
    }

    def policy_mapping_fn(agent_id):
        return "policy_1"

    config={
        # Use the connector server to generate experiences.
        "input": _input,
        "env": None,
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "count_steps_by": "agent_steps"
        },
        "num_workers": 0,
    }

    trainer = a2c.A2CTrainer(config=config)

    # Serving and training loop.
    ts = 0
    for _ in range(args.stop_iters):
        results = trainer.train()
        print(pretty_print(results))
        if results["episode_reward_mean"] >= args.stop_reward or ts >= args.stop_timesteps:
            break
        ts += results["timesteps_total"]

Then just run the client and server scripts on the same machine via different processes.

Like I said, I’m still debugging it, but hopefully it works. Please let me know if you have questions or if you get it to work for you!

@rusu24edward

I may have misunderstood but if the agents are always using a fixed algorithms you could specify it similar to this and not need multiple servers or clients.

config["multiagent"]["policies"] = 
    {"a": ("A2C", ...), "b": ("PPO", ...)} 
1 Like

Maybe I’m misunderstanding now, but don’t you have to pass a Trainer object to tune.run()? I.e. either an A2CTrainer or a PPOTrainer? What would you pass to tune in your example?

@mgerstgrasser,

Yes you have to pass either a trainer or the string it has been registered under.

Normally the first Tuple value in the multiagent policies dictionary is None in which case it uses the one you pass in as the default. If you provide a Trainer instead of None then that will override the default you passed to tune.

I did not know you could put them into tune that way. I’ll give it a try. Just to confirm, this will allow some entities in the game to be controlled by one algorithm and other entities to be controlled by another, and for that to happen within the same environment?

I also found this example, which is quite a bit more complex but claims to accomplish what’s desired here: ray/two_trainer_workflow.py at master · ray-project/ray · GitHub

Yes that is correct.

Just to triple check, this works for learning, not just deploying already-trained policies? If yes, that’s fantastic - much simpler than the example I linked in my original post. Is this documented anywhere?

Yes that is correct. Here is an example that uses it as I described. Here it only trains one of them but it could have trained both. Try setting both of them to PG.

That’s awesome! Glad that this functionality already exists in RLlib

Okay, I’ve tried this just now, and it does look like this is working. I had to specify a policy class, not a a string, and also not a Trainer, so something like this:

config["multiagent"]["policies"] = 
    {"a": (a2c.A2CTFPolicy, ...), "b": (ppo.PPOTFPolicy, ...)} 

Does it then matter which Trainer I pass to tune? A2C or PPO?

Just to follow up after bringing this up in office hours: Yes, it does matter which trainer you pass to Tune, and you cannot use all combinations of algorithms this way. While the policy class will ensure that the train_on_batch() function from the respective algorithm is used, the Trainer determines how batches are put together. This can be a problem: For instace, a DQN Trainer will maintain and sample from a replay buffer, which will not be the right thing to do for non-DQN algorithms, and vice versa. So, passing in different policy classes can work for policies within the same algorithm “family”, but won’t work (correctly) for completely different algorithms that require batches to be assembled in different ways.

1 Like

Thanks for following up on this, @mgerstgrasser! I think this speaks to the need to disambiguate the use of Trainer in the framework. I believe that now the correct term is algorithm, but even that doesn’t capture the separation between the data batching and the algorithm being used to train (some) policies.

@sven1977 Quick follow-up question to office hours: Does it also follow that even with the same algorithm, you can’t have e.g. different batch sizes for different agents?

Hey @mgerstgrasser , sorry for the long silence. Have been quite busy with the Summit lately :frowning:

We can gladly discuss this topic in today’s office hour.

Per design, an Algorithm (formerly known as “Trainer” such as our “PPO” or “A3C” Algorithm sub-classes) should determine, what happens when, e.g. sample collection, training update, weight synching. The Policies on the other hand (e.g. PPOTfPolicy or A3CTorchPolicy) should determine how these things should happen, e.g. compute a loss from a given batch, synchronize weights partially from a main net to a target net using some tau parameter and pytorch, etc…

How @mannyv suggested using different Policies (like PPOPolicy and A3CPolicy) within the same Algorithm (like “PPO”). This would work, iff the policies are compatible with the trainer, which they not always are. For example, it’s probably not a good idea to train a PPOPolicy inside a “DQN” Algorithm due to the off-policy nature of DQN (it would send outdated buffer samples to the PPO loss, which wouldn’t learn properly).

Also, to answer your config (batch size) question. Yes, when setting up multiagent policies (see @mannyv 's post above), you can give a dict like this:

my_config = PPOConfig()
my_config.multi_agent(
  policies={
    "pol1": PolicySpec(config={[... some overrides, e.g. "train_batch_size": 10000]}),
    "pol2": PolicySpec(action_space=[some other action space for this policy], observation_space=...),
},
  policy_mapping_fn: lambda agent_id, episode, worker, *kw: "pol1" if agent_id=="agent1" else "pol2",
)

Ah, I mixed up batch size and rollout fragment length, sorry! You can’t have different values of the latter, can you? I.e. have one policy perform gradient updates more frequently than the other? And yes, happy to discuss in office hours later!

Hey!
Thanks for the explanation. However, I feel like this question wasn’t fully answered (or maybe I am just missing something). I don’t understand if it is possible to use multiple Algorithms in one Tuner. You rightfully expressed your concerns regarding the training of heterogeneous Polices within one Algorithm, but how would you actually do it the right way in RLlib?

An example:
Let’s say I have a multi-agent environment where 2 agents play a game against each other (let’s say rock-paper-scissors). I understand how I apply different policies to each agent, however, I want to train each agent with a different policy AND a different trainer (e.g…, DQN & PPO). Is it possible to pass multiple trainers to tune? Do I have to sub-class Algorithm like in the two_trainer_workflow.py to create a trainable that enables the desired behavior?

How i would like to handle this:

"multiagent": {
               "policies": {
                   "ppo_policy": PolicySpec(config=ppoconfig),
                   "dqn_policy": PolicySpec(config=dqnconfig)
               },
               # Map to either ppopolicy or dqnpolicy behavior based on the agent's ID.
               "policy_mapping_fn": (
                   lambda aid, **kwargs: ["ppo_policy", "dqn_policy"][aid % 2]
               ),
               "policies_to_train": ["dqn_policy", "ppo_policy"],
           },

# pseudocode incoming
results = tune.Tuner(
           Algorithm(PPOTrainer, DQNTrainer), param_space=mixconfig, run_config=air.RunConfig(stop=stop)
       ).fit()