Multi-Agent Training with Different Algorithms

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

(Severity is both none and high - at the moment I’m just curious, but depending on the answer this might completely block me from doing something I wanted to do with rllib.)

I have a quick question on the two-algorithm multiagent example ray/multi_agent_two_trainers.py at master · ray-project/ray · GitHub

I see that this creates separate DQN and PPO trainers, and then in line 154 it calls result_dqn = dqn.train() and in line 159 it calls result_ppo = ppo.train(). If I understand the Trainer class correctly, each Trainer object creates its own environment(s), and then Trainer.train() collect experiences (in addition to learning from them), is that right?

So is my understanding of the entire example correct that
(1) each trainer actually has its own copy of the environment,
(2) each trainer separately collects a different set of experience from its own private environment, and only this trainer trains on those experiences,
(3) there is actually duplicate copies of each neural network: in the DQN env, there is DQN agents (which are learning) as well as PPO agents (which are not learning, and are only queried for actions to generate experiences for the DQN trainer), and similarly in the PPO env there are learning PPO agents and non-learning DQN agents, and
(4) that’s why in line 172-173 we need to synchronise weights, so that the non-learning agents in each env are reasonably up-to-date?

1 Like

Hi Matthias! :wave:t3:

Would you like to ask your question in RLlib Office Hours? :writing_hand:t3: Just add your question to this doc: RLlib Office Hours - Google Docs

Thanks! Hope to see you there!

I implemented a multi-trainer setup using the client-server framework that works a little differently than the rllib example. In my implementation, the trainers don’t alternate training: they both train on the same environment. Basically I have a single client script and two server scripts. My client script has two clients which connect to the servers on different ports. I generate actions from the respective clients, and that allows me (theoretically) to train with two algorithms on the same simulation instance.

I’m still working out some of the bugs, but I can share some pseudo-code if you’re interested in trying it out yourself.

Hi Edward, that would be super interesting! Are you basically using external environments? Or something else? Any code, even pseudocode, you could share would be much appreciated!

I started with the Cartpole Client-Server example. I’m doing a multi-agent game where I want one algorithm controlling some agents and another algorithm controlling others. These agents interact in a common game (think agent in the agent-based-simulation setting and not agent=rl agorithm).

The client:

    # Make two client objects, one for each server. Make sure they have different ports
    client_1 = PolicyClient('http://localhost:9900', inference_mode="local")
    client_2 = PolicyClient('http://localhost:9910', inference_mode="local")

    # Import your sim. I am using custom MultiAgentEnv simulations that I made myself, but this should work with any of the RLlib environments that are able to work with client-server
    import sim # Pseudocode ;)

    for episodes:
        # Start data generation
        obs = sim.reset()
        # Each client has to start the episode
        eid_1 = client_1.start_episode(training_enabled=True)
        eid_2 = client_2.start_episode(training_enabled=True)
        for steps in episode:
            # Combine actions from two servers. Notice that I only give client 1 the observations
            # that are associated with that server, so that it only reports actions for those agents.
            # This is important because everything has to sync up correctly. 
            action = {
                **client_1.get_action(eid_1, obs[agents_with_client_1])
                **client_2.get_action(eid_2, obs[agents_with_client_2])
            }
            # The actions are passed to the simulation, so all the agents are interacting in each step
            obs, reward, done, info = sim.step(action)
            client_1.log_returns(eid_1, reward[agents_with_client_1])
            client_2.log_returns(eid_2, reward[agents_with_client_2])
            if done['__all__']:
                break
        client_1.end_episode(eid_1, obs[agents_with_client_1])
        client_2.end_episode(eid_2, obs[agents_with_client_2])

Server 1:

    # Server 1 will train with ppo
    from ray.rllib.agents import ppo

    def _input(ioctx):
        if ioctx.worker_index > 0 or ioctx.worker.num_workers == 0:
            return PolicyServerInputMA(
                ioctx,
                "localhost",
                9900,
                idle_timeout=3.0
            )
        # No InputReader (PolicyServerInput) needed.
        else:
            return None

    # Here I have this algorithm training multiple policies
    policies = {
        'policy_1': (None, observation_space, action_space, {}),
        'policy_2': (None, observation_space, action_space, {})
    }

    def policy_mapping_fn(agent_id):
        ... # This is straightforward, just map agent ids to policy id as you would do for any multi-agent game

    config={
        # Use the connector server to generate experiences.
        "input": _input,
        "env": None,
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "count_steps_by": "agent_steps"
        },
        "num_workers": 0,
    }

    trainer = ppo.PPOTrainer(config=config)

    # Serving and training loop.
    ts = 0
    for _ in range(args.stop_iters):
        results = trainer.train()
        print(pretty_print(results))
        if results["episode_reward_mean"] >= args.stop_reward or ts >= args.stop_timesteps:
            break
        ts += results["timesteps_total"]

Server 2:

    # Server 2 will train with a2c
    from ray.rllib.agents import a2c

    def _input(ioctx):
        if ioctx.worker_index > 0 or ioctx.worker.num_workers == 0:
            return PolicyServerInputMA(
                ioctx,
                "localhost",
                9900,
                idle_timeout=3.0
            )
        # No InputReader (PolicyServerInput) needed.
        else:
            return None

    # This algorithm just trains a single policy, but I am gonna use the multi-agent setup so that the
    # expected inputs and outputs follow the format (dict of agent ids as keys and obs/actions as
    #  values)
    policies = {
        'policy_1': (None, observation_space, action_space, {}),
    }

    def policy_mapping_fn(agent_id):
        return "policy_1"

    config={
        # Use the connector server to generate experiences.
        "input": _input,
        "env": None,
        "multiagent": {
            "policies": policies,
            "policy_mapping_fn": policy_mapping_fn,
            "count_steps_by": "agent_steps"
        },
        "num_workers": 0,
    }

    trainer = a2c.A2CTrainer(config=config)

    # Serving and training loop.
    ts = 0
    for _ in range(args.stop_iters):
        results = trainer.train()
        print(pretty_print(results))
        if results["episode_reward_mean"] >= args.stop_reward or ts >= args.stop_timesteps:
            break
        ts += results["timesteps_total"]

Then just run the client and server scripts on the same machine via different processes.

Like I said, I’m still debugging it, but hopefully it works. Please let me know if you have questions or if you get it to work for you!

@rusu24edward

I may have misunderstood but if the agents are always using a fixed algorithms you could specify it similar to this and not need multiple servers or clients.

config["multiagent"]["policies"] = 
    {"a": ("A2C", ...), "b": ("PPO", ...)} 

Maybe I’m misunderstanding now, but don’t you have to pass a Trainer object to tune.run()? I.e. either an A2CTrainer or a PPOTrainer? What would you pass to tune in your example?

@mgerstgrasser,

Yes you have to pass either a trainer or the string it has been registered under.

Normally the first Tuple value in the multiagent policies dictionary is None in which case it uses the one you pass in as the default. If you provide a Trainer instead of None then that will override the default you passed to tune.

I did not know you could put them into tune that way. I’ll give it a try. Just to confirm, this will allow some entities in the game to be controlled by one algorithm and other entities to be controlled by another, and for that to happen within the same environment?

I also found this example, which is quite a bit more complex but claims to accomplish what’s desired here: ray/two_trainer_workflow.py at master · ray-project/ray · GitHub

Yes that is correct.

Just to triple check, this works for learning, not just deploying already-trained policies? If yes, that’s fantastic - much simpler than the example I linked in my original post. Is this documented anywhere?

Yes that is correct. Here is an example that uses it as I described. Here it only trains one of them but it could have trained both. Try setting both of them to PG.

That’s awesome! Glad that this functionality already exists in RLlib