Multi-agent: Where does the "first structure" comes from?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

So I have a somewhat common problem, but I could not find a solution that works.

Here is the error:

"""
ValueError: The two structures don't have the same nested structure.

First structure: type=ndarray str=[...]

Second structure: type=OrderedDict str=OrderedDict([('x', array([...])),
('y', array([...]))])

More specifically: Substructure "type=OrderedDict str=OrderedDict([('x', array([...])),
('y', array([...]))])" is a sequence, while substructure "type=ndarray str=[...]" is not
Entire first structure:
.
Entire second structure:
OrderedDict([('x', .), ('y', .)])
"""

The observation space is defined as follows:

obs = spaces.Box(low = -10_000, high = 200_000, shape = (self.n_states,), dtype = np.int32)
obs.seed(0)
self.observation_space = spaces.Dict({
      agent_names[0]: obs, #spaces.Box(low = -10_000, high = 200_000, shape = (self.n_states,), dtype = np.int32),
      agent_names[1]: obs, #spaces.Box(low = -10_000, high = 200_000, shape = (self.n_states,), dtype = np.int32)
})

I tried both, predefined and row-wise, as you can see… always the same error.

Here is the state:

def _get_state(self, return_len=False):
    curr_state = self._get_one_state()
    dict_state = { # OrderedDict(
            agent_names[0]: curr_state,
            agent_names[1]: curr_state,
        }#)
        
    if return_len:
        return dict_state, len(curr_state)
    else:
        return dict_state

Where self._get_one_state() gives me the needed array. I tried with OrderDict too, did not work out…

I can see that “first structure” is an np.array, but I do not understand where it comes from… I also managed to get the single agent environment working easily, but multi-agent might help for the scaling in the near future - so there is a workaround but it would be nice to know where I make the mistake.

P.S. If needed - I can share the whole environment :slight_smile:

One more note, that might be relevant: I used the same code for the trainer as for the single PPO agent, i.e. I did not change anything in the config or wherever.

Hi @vlainic,

It is hard to tell what might be the issue without seeing the entire environment. My best guess from this is maybe it is possible that your are returning a dictionary inside of a dictionary?

Hi @mannyv,
Thanks for the answer.
I do not think that it happens in my environment. However, I am sharing full code anyway now for better clarity:

Hi @vlainic,

Yes I see what you mean my guess was wrong it is actually the opposite.

The issue is arising from your definitions of obs and action space. RLlib expects those spaces to be defined for a single agent not the full multiagent version.

Based on the space you defined it is expecting the obs you return from reset and step to be a dictionary observation with two keys. You may think you are providing this but the difference is that it is expecting this structure for each agent.

Something like:

{
"agent_a": {
        "agent_a": ... 
        "agent_b: ... 
    }
"agent_b:{
        "agent_a": ... 
        "agent_b: ... 
    }
} 

You have two options:

The first is two continue providing the full observation like you are but to make a copy of it for every agent in the environment. This approach will require you to make a custom model that only extracts the key you want based on the agent you are applying the model to. It is a lot of extra data but if you go the route of a centralized critic you may need this info anyway.

The second approach, and the one I would take, would be to include multiple observation (action) space definitions in the environment and in your multi-agent config create a policy specific for each type.

env = make_temp_env(env_config)
multiagent:
            policies: {
                "policy_a": PolicySpec(None, env.obs_spaces["agent_type_a"], env.act_spaces["agent_type_a"], {}),
                "policy_b": PolicySpec(None, env.obs_spaces["agent_type_b"], env.act_spaces["agent_type_b"], {})
            }
env.close()

Hope this helps.


2 Likes

Hey @vlainic, Your code is unfortunately not runnable so I cannot fully reproduce the error that you are seeing. But my guess is that you need to specify self._spaces_in_preferred_format = True during env construction. This is because you have a heterogenous multi-agent case where the obs or action spaces between all agents are not identical. Look at rllib/examples/multi_agent_difference_spaces_for_agent.py as an example on how it’s done properly.

1 Like

Hi @mannyv ,
I tried the second approach first and it worked like a charm. To be more specific I added the following code into the main configuration dictionary:

"multiagent": {
    "policies": {
        'x': PolicySpec(None, env.observation_space['x'], env.action_space['x'], {}),
        'y': PolicySpec(None, env.observation_space['y'], env.action_space['y'], {}),
    },
    "policy_mapping_fn": lambda aid, episode, worker, **kw: "x" if aid.startswith("x") else "y",
},

of course, I did the env initialization and stopping before and after the config setup.

policy_mapping_fn was from the file that @kourosh proposed.

Thank you so much guys :slight_smile:

P.S. This confirms my additional comment from the top:
“One more note, that might be relevant: I used the same code for the trainer as for the single PPO agent, i.e. I did not change anything in the config or wherever.”
I assumed that those things are not needed as it is a kinda 1-to-1 mapping and the model will figure it out automatically, i.e. I expected initially that this is the default behavior.

Hi @kourosh ,
As I told you yesterday, I tried to replicate the issue from the example you have proposed. In my main file, I use tune.run not the tune.Tuner as in the example. So I adopted that part i.e. tried to make training similar to main. And I got the error on this topic even when including the multiagent inside of the config:

Can you please check to see what is wrong here?
Cheers,
Milos

Hey @vlainic, so tuner is just the air interface for tune so not really different there. I just used your notebook and migrated it to the latest master changes and it doesn’t hit that error that you are seeing. So I wonder if you can upgrade your ray to the latest release and try again? You can either use 2.0rc0 or 1.13. Here is the code I tried. You can compare it to your notebook to figure out the migration details:

import argparse
import gym
import os

import ray
from ray import tune #  air, 
from ray.rllib.env.multi_agent_env import MultiAgentEnv

from ray.rllib.algorithms.ppo import PPO, PPOConfig
from ray.tune.registry import register_env

class BasicMultiAgentMultiSpaces(MultiAgentEnv):
    """A simple multi-agent example environment where agents have different spaces.
    agent0: obs=(10,), act=Discrete(2)
    agent1: obs=(20,), act=Discrete(3)
    The logic of the env doesn't really matter for this example. The point of this env
    is to show how one can use multi-agent envs, in which the different agents utilize
    different obs- and action spaces.
    """

    def __init__(self, config=None):
        self.agents = {"agent0", "agent1"}
        self._agent_ids = set(self.agents)

        self.dones = set()

        # Provide full (preferred format) observation- and action-spaces as Dicts
        # mapping agent IDs to the individual agents' spaces.
        self._spaces_in_preferred_format = True
        self.observation_space = gym.spaces.Dict(
            {
                "agent0": gym.spaces.Box(low=-1.0, high=1.0, shape=(10,)),
                "agent1": gym.spaces.Box(low=-1.0, high=1.0, shape=(20,)),
            }
        )
        self.action_space = gym.spaces.Dict(
            {"agent0": gym.spaces.Discrete(2), "agent1": gym.spaces.Discrete(3)}
        )

        super().__init__()

    def reset(self):
        self.dones = set()
        return {i: self.observation_space[i].sample() for i in self.agents}

    def step(self, action_dict):
        obs, rew, done, info = {}, {}, {}, {}
        for i, action in action_dict.items():
            obs[i] = self.observation_space[i].sample()
            rew[i] = 1.0
            done[i] = False
            info[i] = {}
        done["__all__"] = len(self.dones) == len(self.agents)
        return obs, rew, done, info


agent_config = (
    PPOConfig()
    .environment(env=BasicMultiAgentMultiSpaces)
    .resources(num_gpus=0)
    .rollouts(num_rollout_workers=1)
    .multi_agent(
        policies={"main0", "main1"},
        policy_mapping_fn = (
            lambda aid, episode, worker, **kw: f"main{aid[-1]}"
        ),
        policies_to_train=["main0"]
    )
    .framework("torch")
)


ray.init(include_dashboard=False, ignore_reinit_error=True)


analysis = tune.run(
    PPO,
    stop={
        "training_iteration": 10,
        "timesteps_total": 10_000,
        "episode_reward_mean": 80.0,
    },
    config=agent_config.to_dict(),
    # # Milos
    # verbose = 0,    
    # fail_fast = "raise", # for debugging!
)

Hey @kourosh ,

Sorry for the delayed reply, had some things to finish as boss came from the vacations xD

I tried your advice, and I got the same error using 1.13 (my code only) while the 2.0rc0 had no errors even with my notebook…

I will return to 1.13 now and continue where I am left. I guess this issue is addressed with new version 2 :slight_smile: