Repeated in action space

How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi all.

First post, still learning. I have read a lot of documentation, code, forum posts and github issues. Posting here is last resort, I’m really hoping you can point me in the right direction.

I’m trying to train a single ‘leader’ policy that will coordinate a variable number of dumb agents to achieve a goal. The agents only do what the ‘leader’ commands, they have no internal intelligence or logic. Some agents have the same goal, but they do not collaborate - each agent reaches its goal individually. However there is a set of constraints that the agents must not violate, some of which relate to the state of the other agents. The policy will need to look ahead a couple of steps to make sure that this does not happen.

That’s why the action space contains a variable-length list of actions for each agent, which the environment will apply sequentially, one for each planned future step, checking that no constraints will be violated if these are executed as planned. The observation space contains the end result of applying the actions in sequence, as well as a list of the outcome of each action.

My env has action and observation spaces that look something like this:

        self.action_space = Dict({
            agent_id: Repeated(child_space=Dict({ 
					"param_1": Box(low=-18, high=+18, shape=(1,), dtype=np.int8), 
					"param_2": Box(low=0, high=30, dtype=np.int16) 
				}, max_len=10))
			})
            for agent_id in self.agents
        })

        self.observation_space = Dict({
            agent_id: Dict({
                "obs_1": Box(low=-200, high=+200, shape=(2,), dtype=np.float16),
                "obs_2": Box(low=-18, high=+18, shape=(1,), dtype=np.int8),
                "execution_log": Repeated(Dict({
                    "obs_3": Box(low=-200, high=+200, shape=(2,), dtype=np.float16),
                    "obs_4": Box(low=-18, high=+18, shape=(1,), dtype=np.int8)
                }), max_len=300)
            })
            for agent_id in self.agents
        })

Some dict key names have been changed to protect the innocent.

I’m using the APPO algorithm, something like this:

    tuner = tune.Tuner("APPO",
                       run_config=air.RunConfig(
                           stop={"training_iteration": 10},
                           verbose=AirVerbosity.DEFAULT,
                           progress_reporter=reporter,
                           storage_path="%s/build/ray_results" % CWD,
                           name="training_log",
                           checkpoint_config=air.CheckpointConfig(
                               checkpoint_frequency=10,
                               num_to_keep=10,
                               checkpoint_at_end=True)),
                       param_space=param_space)

    results_grid = tuner.fit()

When I run this I get the following error output:

NotImplementedError: Unsupported args: Repeated(Dict('param_1': Box(0, 30, (1,), int16), 'param_2': Box(-18, 18, (1,), int8)), 10) None
Full stack trace
NotImplementedError: Unsupported args: Repeated(Dict('param_2': Box(0, 30, (1,), int16), 'param_1': Box(-18, 18, (1,), int8)), 10) None
Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=12387, ip=192.168.111.128, actor_id=232a3a67c6b776f4c214377001000000, repr=<ray.rllib.evaluation.rollout_worker._modify_class.<locals>.Class object at 0x7f974042a0b0>)
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in __init__
    self._update_policy_map(policy_dict=self.policy_dict)
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
    self._build_policy_map(
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
    new_policy = create_policy_for_framework(
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
    return policy_class(observation_space, action_space, merged_config)
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/algorithms/appo/appo_torch_policy.py", line 84, in __init__
    TorchPolicyV2.__init__(
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 96, in __init__
    model, dist_class = self._init_model_and_dist_class()
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 516, in _init_model_and_dist_class
    model = self.make_model()
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/algorithms/appo/appo_torch_policy.py", line 109, in make_model
    return make_appo_models(self)
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/algorithms/appo/utils.py", line 19, in make_appo_models
    _, logit_dim = ModelCatalog.get_action_dist(
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/models/catalog.py", line 322, in get_action_dist
    return ModelCatalog._get_multi_action_distribution(
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/models/catalog.py", line 923, in _get_multi_action_distribution
    child_dists_and_in_lens = tree.map_structure(
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/tree/__init__.py", line 435, in map_structure
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/tree/__init__.py", line 435, in <listcomp>
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/models/catalog.py", line 924, in <lambda>
    lambda s: ModelCatalog.get_action_dist(s, config, framework=framework),
  File "/home/adamcc/leader/venv/lib/python3.10/site-packages/ray/rllib/models/catalog.py", line 350, in get_action_dist
    raise NotImplementedError(
NotImplementedError: Unsupported args: Repeated(Dict('param_1': Box(0, 30, (1,), int16), 'param_2': Box(-18, 18, (1,), int8)), 10) None

It seems that the model is not recognising the Repeated space: I don’t see it in the referenced get_action_dist() method of catalog.py.

To get around this I replaced the Repeated space in the action_space with a simple Dict:

        self.action_space = Dict({
            agent_id: Dict({
				i: Dict({ 
					"param_1": Box(low=-18, high=+18, shape=(1,), dtype=np.int8), 
					"param_2": Box(low=0, high=30, dtype=np.int16) 
				}) 
				for i in range(0, 10)
			})
            for agent_id in self.agents
        })

This is obviously a fugly hack, and it’s clearly not right, because it will always require 10 actions. In reality the policy will be rewarded for achieving the goal with fewer actions - highest reward for a single action per agent. The policy should only generate enough actions to reach the goal.

Thanks in advance for any hints.

Yours,

Adam

I have improved the work-around somewhat by using Tuple instead of Dict for the actions.

        vector_space = 
        self.action_space = Dict({
            agent_id: Tuple(spaces=[Dict({
                "param_1": track_space,
                "param_2": Box(low=0, high=30, dtype=np.int16)
            }) for i in range(0, 10)])
            for agent_id in self.agents
        })

However that doesn’t fix the problem that the input needs to be a variable length list. The policy needs to find the fewest number of actions that gets each agent to the goal as directly as possible.

To be clear, if I use the Repeated space as described in my first post above, the environment’s sample() method returns a different, random number of actions for each agent_id, as expected.